olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.94k stars 548 forks source link

Missing results within edit distance in fuzzy search #390

Closed lucaong closed 5 years ago

lucaong commented 5 years ago

Hello, the issue https://github.com/olivernn/lunr.js/issues/375 seems to be still occurring in some cases. Fuzzy search, even with stemming disabled, misses some words within the given edit distance. In particular, it seems to miss terms that are equal to the search term plus a suffix as long as the maximum edit distance.

In other words, searching for abc~2 correctly matches abcx, but misses abcxx. Searching for abc~3 correctly matches abcx and abcxx, but misses abcxxx.

Here is a fiddle reproducing the issue on Lunr v3.2.5 (latest release at the moment of posting): https://jsfiddle.net/3ajf72Ly/1/

I know that https://github.com/olivernn/lunr.js/pull/382 addressed this already, and it did fix some cases, but not all.

This is also visible from the expansion of lunr.TokenSet.fromFuzzyString("abc", 2).toArray(), where abc** is missing:

**abc, **bc, **c, *a*bc, *a*c, *ab, *ab*, *ab*c, *abc, *abc*,
*ac, *acb, *b, *b*, *b*c, *bac, *bc, *bc*, *c, *cb, a*, a**, a**bc,
a**c, a*b, a*b*, a*b*c, a*bc, a*bc*, a*c, a*c*, a*cb, ab, ab*,
ab**, ab**c, ab*c, ab*c*, abc, abc*, ac, ac*, ac*b, acb, acb*, b,
b*, b*ac, b*c, ba, ba*, ba*c, bac, bac*, bc, bc*, bca

Thanks again for the great library!

olivernn commented 5 years ago

That's annoying! Thanks for reporting, I'll put together some tests and then figure out how to get a fix out.

lucaong commented 5 years ago

Thanks a lot! Here is one simple test reproducing it: https://github.com/lucaong/lunr.js/commit/24f4223ab2915566dac09e4b4979d8780a597761

I can send a PR if you wish, but for now it would only include the test, as I did not get the time to delve into the code and devise a fix yet.

olivernn commented 5 years ago

Thanks, yeah I had something similar as a test too.

I've got something that fixes that exact bug, but I'm convinced there are other bugs in that code with a similar fix, I'm just trying to find the right test to expose it.

If you look at the changes from the previous diff the fix is to unconditionally add a final frame to the stack. That same pattern is in every other fuzzy operation (insertion, deletion, substitution etc). Indeed, it is the same fix in a different location that fixes what you've reported.

Of course, I could be completely wrong, I'm having a hard time trying to wrap my head around that code at the moment!

I'll spend a bit more time trying to understand the bug in a wider context, otherwise I'll put out a patch in the next couple of days with the more specific fix to unblock you.

lucaong commented 5 years ago

Hi @olivernn Yes I understand. No need to rush for a fix, better to take the due time to properly figure it out.

There is one way I could maybe help. When working on this kind of algorithmic code, where it’s easy to overlook corner cases, I often find it useful to have one or two property based tests, where I assert some invariant on many examples of randomly generated data.

I could help writing one such test for the fuzzy search if that is useful. I have something like that on my own full text search library, and it helped me discover a number of cases: https://github.com/lucaong/minisearch/blob/master/src/SearchableMap/SearchableMap.test.js#L256

I probably won’t make it this weekend, but I might have some time soon if it’s useful.

olivernn commented 5 years ago

I've just pushed a fix for this in 2.3.6. Thanks again for taking the time to investigate and report the bug!