Closed gsavin closed 6 years ago
Thanks. I will try to dig in it too, but I discovered stemming just a few days ago....
Ok, I've taken a closer look at this, and it is due to stemming.
If you look at index.invertedIndex
you will see what Lunr has indexed, it has two entries: iphon
and phone
. So, it is stemming "iphone" to "iphon", dropping the "e", but not doing so with "phone", this is weird, but algorithmic stemming is not perfect. I would certainly have expected those two terms to be stemmed identically.
Knowing that, its now easier to understand the results you are seeing: "phone" can't match "iphone" because its actually indexed as "iphon". You're 4th search is close but doesn't have a wildcard at the end, making it `phon*` would give you the results you expect.
You can remove the stemmer, though this will likely increase the index size, and reduce re-call for most searches:
lunr(function () {
this.pipeline.remove(lunr.stemmer)
// define the index
})
Alternatively you could put a pipeline function before the stemmer that indexes "iphone" as "iphone" and "phone".
Finally, you could try and figure out why the stemmer is doing what it is, though I warn you it is a lot of regex in the implementation, there is also a description of the rules it is following which might help.
If it was me, I'd go for option 2. Let me know how you get along with it.
Hi @olivernn
Thanks a lot for the time you spent on this. I understand that stemming is causing the trouble. Second solution sounds the best one, but I am just afraid there are others terms like iphone
that I am not able to see...
That's why I am using leading and trailing wildcard, but if*phon*
is working fine, *phone*
is not working. I guess the trailing wildcard is understood as at least one character.
I will read the links you gave me and try to use a pipeline function.
I have the same issue with trailing wildcards and stemmer. For example if you try to index sophie
using default index and try to search sophie*
it will return empty results.
Here is fiddle with my example. Workaround for that is include term without wildcard in the query, here is fiddle to demonstrate that.
I had a look into the stemmer code, and found out why he is doing this. There is this regex : var re_5 = /^(.+?)e$/
which is removing the trailing e
of tokens. The same regex will apply for sophie
@mike1808
So it's fine the search is going through the pipeline, but since using leading or trailing wildcard is disabling it, lunr is not able to solve *phone
properly...
Is there a way to run the stemmer on the unwildcarded part of the query and then add the wildcards back ? Something like : '*term' -> '*' + stemmer('term') -> '*stemmedTerm'
.
Just an update. I was able to do what I was saying with :
const results = index.query(function (q) {
const stemmed = index.pipeline.runString('phone')
stemmed.forEach(term => {
q.term(term, {
boost: 10,
usePipeline: false,
wildcard: lunr.Query.wildcard.LEADING
})
})
})
But.... If iphone
is turned as iphon
by the stemmer, phone
is not turned as phon
, because it does not match the replace condition of step 5 if (re.test(stem) || (re2.test(stem) && !(re3.test(stem))))
. Such a pain in the butt..
@gsavin yeah, stemming is a bit of a black box, I didn't write the current stemmer, so its even more of a black box. I did think about re-implementing it, if nothing else then to understand the rules a bit more, but time is always in short supply!
@mike1808 regarding the behaviour of the wildcard, I think that is a bug, or at least incorrectly implemented. I took inspiration for the syntax from Lucene and its *
matches 0 or more, where it looks like Lunr implements it as 1 or more. I'll take a look at what would be involved in fixing this.
@olivernn, I have the same issue as @mike1808 described. Thanks in advance for looking into it.
@mike1808 actually, scratch that, I just wrote a test and it does match 0 or more characters, you're right that it does need stemming to match.
@olivernn, why does *
match 1 or more in Lunr? 0 or more sounds more appropriate.
*
seems moreover like +
here
Is the proposed solution the best option right now?
lunr(function () {
this.pipeline.remove(lunr.stemmer)
// define the index
})
Or is there going to be a future change to wildcard behavior? Right now I do lose search results on many queries when I include the full query term and a wildcard. Like: Mike's `sophie` example.
@mdobeck that is the right way to remove the stemmer when building the index, there is also a search pipeline this.searchPipeline
, that also includes stemming.
I'm going to close this now as I don't think there is anything left to do, feel free to comment and I'll re-open if required.
Hi Oliver,
I am working with lunr since a couple of week (which I have to say is awesome in term of easy to use, and features). I am having a strange behavior who seems to be similar to something already described in this issue #68
I have entries some containing
phone
, others containingiphone
. If I search forphon
, bothphone
andiphone
entries are returned. If I search forphone
, onlyphone
entries are returned.I am using leading and trailing wildcard on term :
I tried by adding only with leading wildcard, but results are the same. It can be easily reproduce with this code :
Am I missing something, or is this a bug ? If it's one, how can I help to solve it ?
Thanks a lot.