olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Strange behaviour with word "general" #377

Open tknuth opened 5 years ago

tknuth commented 5 years ago

I am getting a little crazy because I do not understand the following behaviour. I have a couple of objects like these:

{
  category: "general"
  title: "a title"
}

When I search for "general", I get results only as long as I type up to "gener". Typing "genera" or "general" removes all matching results.

The word "general" is not part of the stopwordfilter.js, as far as I understand. So what is causing this? Changing it to "generol" works just fine. It seems to be this specific phrase, but I could not find any hint in the repo.

lunr(function() {
this.ref("path");
this.field("category");
this.field("title");
docs.forEach(function(docObj) {
    this.add(docObj);
}, this);
});
idx(tree)
.query(function(q) {
    q.term(query.toLowerCase(), {
    wildcard:
        lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
    });
})
tknuth commented 5 years ago

The same occurs with "introduction" ... it works until the "T" in "introducTion". Am I missing a basic concept?

hoelzro commented 5 years ago

@tknuth When you say you changed "general" to "generol", is this in the document, or in the query?

The reason for this comes down to the interaction between wildcards and stemming - judging from a little test I ran, the stemmer removes the al suffix from words, so "general" is stored in the inverted index as "gener". When you're using wildcards with your input, "gener" will match any prefix of itself up to "gener" - as you observed - but once you query with "genera", lunr starts looking for tokens that contain "genera". Since "gener" doesn't contain that string, it doesn't contribute towards a match.

Let's back up a bit and talk about things at a higher level - what are you trying to use lunr to do? It might be possible with some tweaking to the index builder, or it could be that lunr isn't a good fit for your use case. Let's have a discussion to see if lunr works for you!

tknuth commented 5 years ago

@hoelzro Thank you so much for your effort! So it was just a lack of understanding of how lunr works internally.

I removed the stemmer:

this.pipeline.remove(lunr.stemmer);
this.searchPipeline.remove(lunr.stemmer);

Now lunr works as expected, which makes me happy.

However, there is obviously value in what the stemmer does. So I would ideally want to be able to match words exactly in addition to the standard behaviour of lunr.

In other words, I could create two lunr search indices, one with a stemmer and one without. Then I could merge the result lists into one. Would that have negative side effects besides reduced performance?

Regarding your higher-level question: I would like to use lunr for search on a web page, and the problem arose when I wanted to allow the user to search for parts of the table of contents as well. I know that some of my users like copy and pasting terms, and it is crucial that you can search for complete words. That's why the issue came up.

To sum up:

  1. My problem is solved
  2. However, a combined solution would be even better. Have you got comments on my suggested approach or another idea how to achieve that?

Thank you so much for your time and effort. I appreciate that!

hoelzro commented 5 years ago

@tknuth I think your approach of having two separate indices would work, but as you point out, you could run into performance issues. Another potential issue is the ranking of search results - I'm not exactly sure how that would shake out when merging result lists. @olivernn and I were having a discussion about this issue on one of my projects using lunr here - I plan on experimenting with different approaches in the near future. I'll follow up here if I feel like I find a good solution!

chris-miaskowski commented 1 month ago

I've seen this error as well and the solution that is often mentioned is to remove stemmer. I found a different solution that allows keeping both the stemmed result as well as wildcarded w/o the need for another index.

var idx = lunr(function () {
  this.ref('id')
  this.field('text')
  this.pipeline.remove(lunr.stemmer)
  // replace original stemmer with one that returns two tokens
  this.pipeline.add((token) => {
    const clone = token.clone()
    return [lunr.stemmer(token), clone]
  })
  this.metadataWhitelist = ['position']

  autocompleteWithIds.forEach((doc) => {
    this.add(doc)
  })
})

// const result = idx.search('*billing*')
const result = idx.query((q) => {
  // for each word add two queries: one with wildcard, one w/o
  q.term('billed*', { wildcard: lunr.Query.wildcard.TRAILING | lunr.Query.wildcard.LEADING })
  q.term('billed')
})

This will result in stemming the search terms and then will look for "bill" as well as billed*. If you indexed "billing" it will work for "billing", "bill", "billed", etc