olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 548 forks source link

Fuzzy search edit distance #410

Closed LukasBreitwieser closed 4 years ago

LukasBreitwieser commented 5 years ago

Hi Oliver,

Great project! I played a bit around with fuzzy search and observed the following behavior:

Screenshot from 2019-07-10 09-56-05

Initially, I thought it is a bug, since admiral and addition have a Levenshtein distance of 5. However, I was looking a bit further and came across the Soundex distance. According to [1] admiral and addition have a Soundex distance of 2.

Are you using Soundex or is this a bug? If you are using Soundex, is there a way to use Levenshtein distance instead?

Lukas

[1] http://www.ripelacunae.net/projects/levenshtein/

olivernn commented 5 years ago

I think this is due to stemming.

By default, and this is how the demo is setup, all words are stemmed before entering the index. As a result neither "admiral" nor "addition" are stored as is in the index, instead their stemmed form is:

It is these words that are then used when calculating the edit distance, which is why "addition" matches a search for "admiral~2".

Stemming has been the cause of much confusion in the past (just take a look through some of the closed issues) but I think this is the first time I've seen it combine with an edit distance search to produce 'wrong' results.

Disabling the stemmer would stop returning 'wrong' results like this, though it would probably lead to other unexpected results in other searches.

So, lunr doesn't use soundex and I'm not sure it's technically a bug, more like one of the many compromises present in implementing full text search.