leeoniya / uFuzzy

A tiny, efficient fuzzy search that doesn't suck
MIT License
2.65k stars 48 forks source link

uFuzzy: distinguish between C, C++ and C# in haystack #30

Closed stargazer33 closed 1 year ago

stargazer33 commented 1 year ago

I have a haystack containing programming/tech terms with special symbols:

C#
C++
C
.NET
C/C++ 
$105,000
#LI-Remote

(I'm searching tech job ads)

What uFuzzy settings needed to match these terms? I tried to tweak the intraChars but without success - I see that uFuzzy can not distinguish between C, C++ and C#

leeoniya commented 1 year ago

what you're showing doesnt look like a job for a fuzzy matcher but exact matches with a tag index. e.g you cannot fuzzy match $105,000 with $95k, even though logically they're close.

if you're looking for exact terms/tags that include punctuation, i'd recommend doing exact matches to pre-filter the haystack before using ufuzzy on stuff like job descriptions. you'll benefit a lot from building up a tags index for known things like comp ranges, langs, and tech stack items.

stargazer33 commented 1 year ago

you cannot fuzzy match $105,000 with $95k

you are right! My idea was to distinguish between and 105,000 and $105,000, nothing else :-)

More generally speaking: how to search for strings containing characters like #, ., $, +?

So, are you suggesting something like this, right? (pseudocode):

if (needle contains one or more characters  `#`, `.`, `$`, `+` ){
  do exact match without uFuzzy
}
else {
  uFuzzy.search(haystack, needle)
  ...
}

And, I can imagine that somebody needs exact search for one term and fuzzy search for another term... I can imagine a needle like C# europ - imagine haystack containing things like C++, C#, Europe, European.

What COULD be a solution: to indicate "I want exact match for this term" using quotes. Something like: "C#" europ I think Google supports something like this, so this syntax is known to end-users...

leeoniya commented 1 year ago

So, are you suggesting something like this, right? (pseudocode):

i mean let cSharpHaystack = haystack.filter(value => value.includes('C#')). then use cSharpHaystack for uFuzzy search.

I tried to tweak the intraChars but without success - I see that uFuzzy can not distinguish between C, C++ and C#

you need to adjust interSplit. this is what segments needle terms on punctuation and whitespace. there will probably be issues with this because the remaining terms are used to construct a regex without escaping, so something like C++ will likely break things since + is a special regex char. it's probably a good idea for uFuzzy to add escaping here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#escaping

What COULD be a solution: to indicate "I want exact match for this term" using quotes.

certainly more features can be added, but it's a slippery slope. i already added substring exclusions because they are super useful. what if you wanted (C# || C++) europ?

again, for stuff that is not fuzzy and known in advance to exist in the haystack, you really should build up an efficient index to pre-filter by tags and not rely on a fuzzy string search.

leeoniya commented 1 year ago

also things with an optional prefix won't work well (105,000 and $105,000), since by default uFuzzy requires first char to match. you can tweak even more options to bypass this, but you'll just end up with more crap. basically, don't use uFuzzy for stuff you know to exist exactly in the corpus. make an index or prefilter the haystack using a regex or exact substring match.

leeoniya commented 1 year ago

i'll consider adding support for quoted exact terms. it's not an easy addition if you want highlighting and result ranking to work in a logical manner.

stargazer33 commented 1 year ago

...things with an optional prefix won't work well (105,000 and $105,000)...

let us forget about optional prefixes with special chars like $ from my point of view this is not the most important feature. just the end-user search description should mention something like "by default uFuzzy requires first char to match."

stargazer33 commented 1 year ago

i'll consider adding support for quoted exact terms. it's not an easy addition if you want highlighting and result ranking to work in a logical manner.

Great! Absolutely great! @leeoniya I promise to test the "quoted exact terms" once it become availiable. (unfortunately - can not really help with implementation, I am Java developer ...)

leeoniya commented 1 year ago

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&lists=test&search=%22C%2b%2b%22

the exact matches are still case-insensitive (like google's exact matches), because the regexps are internally all case-insensitive.

you can try it out before the next release by doing a github dependency in package.json:

"@leeoniya/ufuzzy": "leeoniya/uFuzzy#8bcbb552104f20aadfabe42a304b39c60ea85338"

or just download and use the file(s) from /dist directly.

stargazer33 commented 1 year ago

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&lists=test&search=%22C%2b%2b%22

the exact matches are still case-insensitive (like google's exact matches), because the regexps are internally all case-insensitive.

you can try it out before the next release by doing a github dependency in package.json:

"@leeoniya/ufuzzy": "leeoniya/uFuzzy#8bcbb552104f20aadfabe42a304b39c60ea85338"

or just download and use the file(s) from /dist directly.

@leeoniya - I used it - looks like a working solution! can not detect any bugs for now.

Many thanks for the great and quick implementation!