leeoniya / uFuzzy

A tiny, efficient fuzzy search that doesn't suck
MIT License
2.65k stars 48 forks source link

Can't tollerate error in first character in SingleError mode #39

Closed arctica closed 1 year ago

arctica commented 1 year ago

It seems like errors in the first letter of a word are not handled even when SingleError mode is used.

For example Direball does not return results containing Fireball.

Test: https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&intraMode=1&intraIns=1&intraChars=[a-z\d%20]&outOfOrder&search=Direball

I found the intraSlice option and tried with [0, Infinite] but that didn't do the trick.

leeoniya commented 1 year ago

hmm, [0, Infinity] should do it :thinking:

leeoniya commented 1 year ago

this appears to be working?

https://jsfiddle.net/j7xkba1v/2/

let haystack = ["Fireball"];

let needle = "Direball";

let uf = new uFuzzy({
  intraMode: 1,
  intraIns:  1,
  intraSub:  1,
  intraTrn:  1,
  intraDel:  1,
  intraSlice: [0, Infinity],
});

let out = uf.search(haystack, needle);

console.log(out);
leeoniya commented 1 year ago

maybe you forgot to set the additional tolerances? (intraMode: 1 does not auto-enable them).

this is kind of a footgun tho, maybe uFuzzy should auto-set them all when intraMode: 1 and require opting out rather than opting in.

leeoniya commented 1 year ago

that being said, you probably dont want unconditional intraSlice: [0, Infinity], since typing a single letter will bring back the entire haystack. you most likely want these tolerance rules to be dynamic based on term length, which is what intraRules is for:

https://github.com/leeoniya/uFuzzy/blob/9486cf58cd937bc15082de217febfbf793bd589a/dist/uFuzzy.d.ts#L147-L154

the default intraRules already have some small-word guards that prevent a bad experience, so you may want to copy that fn and provide your intraSlice logic in addition to this:

https://github.com/leeoniya/uFuzzy/blob/9486cf58cd937bc15082de217febfbf793bd589a/src/uFuzzy.js#L150-L188

arctica commented 1 year ago

Thanks for the fast reply. I've noticed that the reason why my test with intraSlice: [0, Infinity] didn't work is because the string in the hackstack contained more than the needle word (think Fireball Test. It seems like it doesn't work as a partial/prefix match anymore.

So searching for Dire will match when the haystack is Fire but wont match if the haystack is Fireball. In the Fiddle you provided you can see this by changing the needle to Dire instead of Direball. If we instead search for Fire it will match Fireball just fine. Weirdly though the needles Direb or Direba and so on do return a match.

Thank you for pointing out the intraRules option though, that will be handy.

leeoniya commented 1 year ago

didn't work is because the string in the hackstack contained more than the needle word (think Fireball Test. It seems like it doesn't work as a partial/prefix match anymore.

if you want only prefix matching you should set interLft: 2.

replacing Fireball with Fireball Test in the haystack still works just fine for "Direball" needle. but as i explained above there is additional intraRules logic to make the matching more strict for needle terms <= 4 chars, which is probably why Dire does not work. maybe that default logic needs to be tweaked, as it currently assumes intraSlice is the default [1, Infinity]. (PRs are welcome).

you can replace intraRules with your own function that has no special treatment of small terms and follows the same tolerance rules exactly for all terms. i'm not sure you'll be happy with the result of this, though...i wasn't. do you really want:

arctica commented 1 year ago

Oh sorry. I didn't understand the infraRules part correctly before. I see now that it has special logic for terms shorter than 5 characters which is why Dire was the cut-off point while Direb still worked.

Having run tests on my production haystack of roughly 25k entries I am definitely seeing considerably better results by overriding infraRules for short terms which would otherwise result in no matches at all. One example I saw was cork not matching a name starting with Cook County without the override. It does have side effects like changing the results that would have matches with and without this setting but in those cases it was sometimes hard to tell which one should be preferred.

I should point out though that in my application it is much worse to show no results than to show something that is not exactly a good match. When the haystack is big enough, it seems even short search terms will hit enough good results. On the flip side if results are omitted due to a single character error then that has a much higher negative impact on the user. We do start the search only if there are at least 2 characters. We then take the top N results and present them to the user.

So with the following options I'm seeing pretty decent results.

const opts = {
  intraMode: 1,
  intraChars: `[a-z\d' ]`,

  intraRules: () => {
    return {
      intraIns: 1,
      intraSub: 1,
      intraTrn: 1,
      intraDel: 1,
      intraSlice: [0, Infinity]
    };
  }
}

For longer search terms allowing more fuzzyness (e.g. 2 errors) could improve the results even further. But I think uFuzzy does not support that.

leeoniya commented 1 year ago

One example I saw was cork not matching a name starting with Cook County without the override.

i've never misspelled cook as cork personally since o and r arent very close on a QWERTY layout. but i also used to live in Cook County, so i might be biased :rofl:

For longer search terms allowing more fuzzyness (e.g. 2 errors) could improve the results even further. But I think uFuzzy does not support that.

yeah more than 1 error is not really feasible since uFuzzy effectively constructs all possible regexp permutations of a single error for each term. those regexps can get pretty large already. any additional fuzziness would be too absurd for the current design 😅 . im happy with the tradeoff and outcome. people who need something different have quite a few other options to choose from.

arctica commented 1 year ago

Makes sense. It's pretty amazing and surprising that the regexp approach works as good as it does :)

The cork example might have not been the best but you can take cooj if you will. That's right next on the keyboard and just a single error (not even at the start) away but results in no matches. Cork could have been a typo introduced by a spelling auto correct I guess. Sometimes they are hurting rather than helping :)

This plus further testing where I saw results getting sometimes a bit too far from reasonable for very short terms have lead me to an approach that so far looks like a pretty good tradeoff:

I first do a search without the intraRules override and only when there are fewer than X (e.g. 10) results do I do another round but this time with the intraRules override. This has the benefit of returning better matches for short terms in the usual case and has noticably better performance (up to 10x faster) while still not missing matches that need a bit more fuzziness.

but i also used to live in Cook County, so i might be biased rofl

haha, what a funny coincidence :)