olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Matching results disappear when expending the term #295

Closed gsavin closed 6 years ago

gsavin commented 7 years ago

Hi Oliver,

I am working with lunr since a couple of week (which I have to say is awesome in term of easy to use, and features). I am having a strange behavior who seems to be similar to something already described in this issue #68

I have entries some containing phone, others containing iphone. If I search for phon, both phone and iphone entries are returned. If I search for phone, only phone entries are returned.

I am using leading and trailing wildcard on term :

q.term(term, {
    boost: 100
})

q.term(term, {
     boost: 10,
     usePipeline: false,
     wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
})

I tried by adding only with leading wildcard, but results are the same. It can be easily reproduce with this code :

const lunr = require('lunr')
const data = [{name: 'phone'}, {name: 'iphone'}]
const index = lunr(function () {
  this.ref('name')
  this.field('name')
  data.forEach(item => this.add(item))
})

index.search('phone')
// [ { ref: 'phone', score: 1, matchData: { metadata: [Object] } } ]
index.search('iphone')
// [ { ref: 'iphone', score: 1, matchData: { metadata: [Object] } } ]
index.search('*phone')
// [ { ref: 'phone', score: 1, matchData: { metadata: [Object] } } ]
index.search('*phon')
// [ { ref: 'iphone', score: 1, matchData: { metadata: [Object] } } ]

Am I missing something, or is this a bug ? If it's one, how can I help to solve it ?

Thanks a lot.

olivernn commented 7 years ago

I've created a fiddle with your reproduction, at the very least those results are not what I would expect, specifically the third one "*phone".

My guess is this is to do with stemming, though I'd have to do a bit more digging to be sure.

gsavin commented 7 years ago

Thanks. I will try to dig in it too, but I discovered stemming just a few days ago....

olivernn commented 7 years ago

Ok, I've taken a closer look at this, and it is due to stemming.

If you look at index.invertedIndex you will see what Lunr has indexed, it has two entries: iphon and phone. So, it is stemming "iphone" to "iphon", dropping the "e", but not doing so with "phone", this is weird, but algorithmic stemming is not perfect. I would certainly have expected those two terms to be stemmed identically.

Knowing that, its now easier to understand the results you are seeing: "phone" can't match "iphone" because its actually indexed as "iphon". You're 4th search is close but doesn't have a wildcard at the end, making it `phon*` would give you the results you expect.

You can remove the stemmer, though this will likely increase the index size, and reduce re-call for most searches:

lunr(function () {
  this.pipeline.remove(lunr.stemmer)
  // define the index
})

Alternatively you could put a pipeline function before the stemmer that indexes "iphone" as "iphone" and "phone".

Finally, you could try and figure out why the stemmer is doing what it is, though I warn you it is a lot of regex in the implementation, there is also a description of the rules it is following which might help.

If it was me, I'd go for option 2. Let me know how you get along with it.

gsavin commented 7 years ago

Hi @olivernn

Thanks a lot for the time you spent on this. I understand that stemming is causing the trouble. Second solution sounds the best one, but I am just afraid there are others terms like iphone that I am not able to see...

That's why I am using leading and trailing wildcard, but if*phon* is working fine, *phone* is not working. I guess the trailing wildcard is understood as at least one character.

I will read the links you gave me and try to use a pipeline function.

mike1808 commented 7 years ago

I have the same issue with trailing wildcards and stemmer. For example if you try to index sophie using default index and try to search sophie* it will return empty results.

Here is fiddle with my example. Workaround for that is include term without wildcard in the query, here is fiddle to demonstrate that.

gsavin commented 7 years ago

I had a look into the stemmer code, and found out why he is doing this. There is this regex : var re_5 = /^(.+?)e$/ which is removing the trailing e of tokens. The same regex will apply for sophie @mike1808 So it's fine the search is going through the pipeline, but since using leading or trailing wildcard is disabling it, lunr is not able to solve *phone properly... Is there a way to run the stemmer on the unwildcarded part of the query and then add the wildcards back ? Something like : '*term' -> '*' + stemmer('term') -> '*stemmedTerm'.

gsavin commented 7 years ago

Just an update. I was able to do what I was saying with :

const results = index.query(function (q) {
  const stemmed = index.pipeline.runString('phone')

  stemmed.forEach(term => {
    q.term(term, {
      boost: 10,
      usePipeline: false,
      wildcard: lunr.Query.wildcard.LEADING
    })
  })
})

But.... If iphone is turned as iphon by the stemmer, phone is not turned as phon, because it does not match the replace condition of step 5 if (re.test(stem) || (re2.test(stem) && !(re3.test(stem)))). Such a pain in the butt..

olivernn commented 7 years ago

@gsavin yeah, stemming is a bit of a black box, I didn't write the current stemmer, so its even more of a black box. I did think about re-implementing it, if nothing else then to understand the rules a bit more, but time is always in short supply!

@mike1808 regarding the behaviour of the wildcard, I think that is a bug, or at least incorrectly implemented. I took inspiration for the syntax from Lucene and its * matches 0 or more, where it looks like Lunr implements it as 1 or more. I'll take a look at what would be involved in fixing this.

myalgo commented 7 years ago

@olivernn, I have the same issue as @mike1808 described. Thanks in advance for looking into it.

olivernn commented 7 years ago

@mike1808 actually, scratch that, I just wrote a test and it does match 0 or more characters, you're right that it does need stemming to match.

olivernn commented 7 years ago

@myalgo @mike1808 the test I just added.

myalgo commented 7 years ago

@olivernn, why does * match 1 or more in Lunr? 0 or more sounds more appropriate. * seems moreover like + here

mdobeck commented 6 years ago

Is the proposed solution the best option right now?

lunr(function () {
  this.pipeline.remove(lunr.stemmer)
  // define the index
})

Or is there going to be a future change to wildcard behavior? Right now I do lose search results on many queries when I include the full query term and a wildcard. Like: Mike's `sophie` example.

olivernn commented 6 years ago

@mdobeck that is the right way to remove the stemmer when building the index, there is also a search pipeline this.searchPipeline, that also includes stemming.

I'm going to close this now as I don't think there is anything left to do, feel free to comment and I'll re-open if required.