jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

which params should I play with? #41

Open halukkaramete opened 2 years ago

halukkaramete commented 2 years ago

Hi Jean,

Please take a look at the 2 screenshots

The first screenshot is showing the search results of "Black Seed and Honey". The second screenshot shows the same data set when just the "honey" is searched.

Screen Shot 2022-05-10 at 1 34 35 PM Screen Shot 2022-05-10 at 1 35 31 PM

When honey is searched with "black seed and honey", almost all of its entries are being crashed and pushed down. When the word "black" is a full match, I cannot criticize it because both "black" and "honey" are 5 letters... But the following entries should not end up getting more points than the 2nd screenshot I posted above.

Screen Shot 2022-05-10 at 1 46 21 PM

What do I need to achieve so that the one above is less valued than the full word match "honey"?

Also, as the examples below demonstrates, these are very poor matches. What param should be at which value ( in the option ) so the algorithm does not think these are great matches and therefore no scores should be given for those scattered matches.

Screen Shot 2022-05-10 at 1 52 11 PM

As you can easily tell, these have nothing to do with "Black Seed and Honey" and entires like the below ( with perfect & bingo "honey" matches should definitely be above them.

Screen Shot 2022-05-10 at 1 54 34 PM

Sometimes, I think of catching the full results set of the fuzzy search and then run thru a loop ( which I will be writing to weed out the scattered ones and move the bingo words to the top by revalueing them with more scores..

But if we can do this with your params already, then, why get into a second loop and a second scoring process. Right?

jeancroy commented 2 years ago

I think the issue you see is that once you have a perfect match (query is in the document) we stop there.

This library has no concept of thematic. This paragraph has more honey that the other one.

It would not be crazy to do pre-processing (ie extract themes/keyword) and search those instead of full text (or with a bonus). It's also very possible that you need your own layer of logic after match pass a minimum quality score. The minimum quality can be high because you want to order document that all have perfect match (as far as this library is concerned)