leeoniya / uFuzzy

A tiny, efficient fuzzy search that doesn't suck
MIT License
2.57k stars 47 forks source link

Slow match and crash for single-error strings #65

Open jquense opened 2 months ago

jquense commented 2 months ago

I am not sure what specifically is causing the regex to take such a long time to match, and the example is somewhat contrived, but this did happen to us when trying to get match info from a small list of google places. The problem in this case was easily avoided by using googles metadata but I can imagine that this can happen elsewhere:

https://codepen.io/jquense/pen/GRaKqwv

leeoniya commented 2 months ago

this is something that has come up multiple times, but i'm not sure whether the lib should address this, or the user should.

for needles that look like copy-pasted things, just falling back to exact substring match is probably the way to go. there's already a similar guard behavior to prevent outOfOrder from exploding like this when you have > 5 terms.

maybe there should be a default threshold of 25 chars or something (whatever is reasonable to type by hand). generally i'm not a huge fan of search behavior changing silently, but it's probably better than the current situation.

jquense commented 1 month ago

I can appreciate that this is a hard thing to have a default for. Is the recommendation just to bail out of using ufuzzy in the case of long needles? Even if the fallback isn't built in it may be nice for ufuzzy to provide a mode that produces info and ranges for a simple substring token match as a convenience so that you can still pass stuff through the sort and highlight paths?

leeoniya commented 1 month ago

Is the recommendation just to bail out of using ufuzzy in the case of long needles?

my recommendation is to use another uFuzzy instance configured with intraMode: 0, this way you get case-insensitive substring matching and highlighting.

however, uFuzzy has no way of making a term optional. they're all required. so even in intraMode: 0 your example would return 0 results since your needle includes , USA but the haystack does not. to get matching like this you need to use a library that builds an index and can account for term frequency / similarity (something like a trigram index); uFuzzy is just a clever regexp compiler :)