olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Inconsistent Results #328

Closed TomFoyster closed 6 years ago

TomFoyster commented 6 years ago

Hi all, I'm having some issues with inconsistent results in Lunr.

Presuming I have 1 object in my store, { label: Assistant }

A search for assist yields one result, however a search for assista yields none. In fact, there's no result for assistan either, but assistant returns the expected result.

I've created a JSFiddle that replicates the issue;

https://jsfiddle.net/mfzx2gq9/1/

There are similar issues with the words Nursing and Accommodation too.

What's going on? Is this an issue with stemming? If so how can it be resolved?

hoelzro commented 6 years ago

@TomFoyster You're absolutely right - it is an issue with stemming. If you try stemming each of those variants with this.pipeline.runString from within the builder function, you get these results:

assist    assist
assista   assista
assistan  assistan
assistant assist

I'm guessing the stemmer recognizes the -ant suffix and strips it out. What exactly do you want to accomplish? Do you want any partial string to match?

TomFoyster commented 6 years ago

@hoelzro Thanks for your reply.

Yes, essentially we want any partial to match - I think I'm struggling to get my head around stemming - this is legacy code written by a contractor that we're now needing to support.

ass, assi, assis, and assist all return a match on assistant - so we need it to follow the whole way through the word.

This is all part of an autocomplete system, so currently as the user types results appear and then disappear, and then reappear again when they finish typing the word.

Massive thanks for your help, it's greatly appreciated.

TomFoyster commented 6 years ago

@hoelzro I've made some progress, but I'm not quite there.

I've updated the fiddle to better show the issue;

https://jsfiddle.net/mfzx2gq9/5/

I've experimented with some code I've found elsewhere, which has reduced the issue down to a single instance returning no results - assistan. This can be solved by upping the editDistance value to 6 - increasing the fuzzyness of the search. This though would lead to some poor matches and has a noticeable negative impact on the search time, even in this very small example.

If my understanding is correct, and based on the query methods in my example;

ass, assi, and assis are returned as they partially match (with no pipeline) assist and assistant matches as stemmed they match the stem of assistant, assist. assista matches as it's caught by the third, fuzzy rule

assistan falls outside of all though as it isn't stemmed at all, and the seach isn't fuzzy enough.

I think, while Lunr is obviously very powerful - it should never have been chosen for this function - unless someone can show me a query method that will work in the way I need?

hoelzro commented 6 years ago

@TomFoyster Since this is part of an autocomplete system, it sounds like you need prefix search, and I would agree with your assessment that lunr probably isn't a good fit this task. If you don't need stemming itself (eg. normalizing jumped, jumps, and jumping to jump), you could turn stemming off in the pipeline, and that might give you better results. I find wildcards and stemming kind of create a strange situation, since assistan* won't match assistant, then, because the latter is stemmed down to assist. If you're looking for fuzzy searches, you may have better luck with a library like http://fusejs.io - I haven't used it myself, but it seems to be a more suitable fit. I don't know how tightly integrated lunr is into your application, though!

olivernn commented 6 years ago

As @hoelzro has suggested, removing the stemming is probably the right approach here, I've updated the fiddle to show how.

Hopefully without stemming you should get results that make more sense in an autocomplete. Autocomplete wasn't the original intended use case for Lunr, I would hope it would at least be possible to get reasonable results with the right configurations though.

This can be solved by upping the editDistance value to 6

Yeah, that is going to be slow! That will result in 125,549 lookups against the index:

lunr.TokenSet.fromFuzzyString("assistan", 6).toArray().length

If speed is of a concern you can balance dropping the leading wildcard from the search, it will be a tradeoff though between speed and result accuracy/recall.

I can put together a guide on the website about setting up queries/indexes for use in an autocomplete search that might help others in the future.

Frexuz commented 5 years ago

Borrowing issue I have the same issue. But I can't quite figure out how to turn off the stemmer? :)

EDIT: Found it :)

this.pipeline.remove(lunr.stemmer)
this.searchPipeline.remove(lunr.stemmer)