Closed TomFoyster closed 6 years ago
@TomFoyster You're absolutely right - it is an issue with stemming. If you try stemming each of those variants with this.pipeline.runString
from within the builder function, you get these results:
assist assist
assista assista
assistan assistan
assistant assist
I'm guessing the stemmer recognizes the -ant
suffix and strips it out. What exactly do you want to accomplish? Do you want any partial string to match?
@hoelzro Thanks for your reply.
Yes, essentially we want any partial to match - I think I'm struggling to get my head around stemming - this is legacy code written by a contractor that we're now needing to support.
ass
, assi
, assis
, and assist
all return a match on assistant - so we need it to follow the whole way through the word.
This is all part of an autocomplete system, so currently as the user types results appear and then disappear, and then reappear again when they finish typing the word.
Massive thanks for your help, it's greatly appreciated.
@hoelzro I've made some progress, but I'm not quite there.
I've updated the fiddle to better show the issue;
https://jsfiddle.net/mfzx2gq9/5/
I've experimented with some code I've found elsewhere, which has reduced the issue down to a single instance returning no results - assistan
. This can be solved by upping the editDistance value to 6 - increasing the fuzzyness of the search. This though would lead to some poor matches and has a noticeable negative impact on the search time, even in this very small example.
If my understanding is correct, and based on the query methods in my example;
ass
, assi
, and assis
are returned as they partially match (with no pipeline)
assist
and assistant
matches as stemmed they match the stem of assistant, assist
.
assista
matches as it's caught by the third, fuzzy rule
assistan
falls outside of all though as it isn't stemmed at all, and the seach isn't fuzzy enough.
I think, while Lunr is obviously very powerful - it should never have been chosen for this function - unless someone can show me a query method that will work in the way I need?
@TomFoyster Since this is part of an autocomplete system, it sounds like you need prefix search, and I would agree with your assessment that lunr probably isn't a good fit this task. If you don't need stemming itself (eg. normalizing jumped
, jumps
, and jumping
to jump
), you could turn stemming off in the pipeline, and that might give you better results. I find wildcards and stemming kind of create a strange situation, since assistan*
won't match assistant
, then, because the latter is stemmed down to assist
. If you're looking for fuzzy searches, you may have better luck with a library like http://fusejs.io - I haven't used it myself, but it seems to be a more suitable fit. I don't know how tightly integrated lunr is into your application, though!
As @hoelzro has suggested, removing the stemming is probably the right approach here, I've updated the fiddle to show how.
Hopefully without stemming you should get results that make more sense in an autocomplete. Autocomplete wasn't the original intended use case for Lunr, I would hope it would at least be possible to get reasonable results with the right configurations though.
This can be solved by upping the editDistance value to 6
Yeah, that is going to be slow! That will result in 125,549 lookups against the index:
lunr.TokenSet.fromFuzzyString("assistan", 6).toArray().length
If speed is of a concern you can balance dropping the leading wildcard from the search, it will be a tradeoff though between speed and result accuracy/recall.
I can put together a guide on the website about setting up queries/indexes for use in an autocomplete search that might help others in the future.
Borrowing issue I have the same issue. But I can't quite figure out how to turn off the stemmer? :)
EDIT: Found it :)
this.pipeline.remove(lunr.stemmer)
this.searchPipeline.remove(lunr.stemmer)
Hi all, I'm having some issues with inconsistent results in Lunr.
Presuming I have 1 object in my store,
{ label: Assistant }
A search for
assist
yields one result, however a search forassista
yields none. In fact, there's no result forassistan
either, butassistant
returns the expected result.I've created a JSFiddle that replicates the issue;
https://jsfiddle.net/mfzx2gq9/1/
There are similar issues with the words
Nursing
andAccommodation
too.What's going on? Is this an issue with stemming? If so how can it be resolved?