lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.9k stars 137 forks source link

Score enhancements #52

Closed normgh closed 4 years ago

normgh commented 4 years ago

Hi Luca,

Congratulations, and thank you, for writing such great code and creating such an excellent client side search solution.

Minisearch is awesome!

I'm very keen to implement it in an application that searches approx 6000 food products and I'm hoping that you may be able to give me some advise on the best way to improve some of the score results that I'm getting on my data.

My customers search by product codes, or product descriptions and/or product brands so those are the 3 fields that I'm searching on.

I'm experimenting with fuzzy settings of around .5 to catch spelling issues on words like broccoli, for which I'm using test cases of 'br', 'bro', 'broc', 'broco', 'brocol', 'brocoli' etc

One of my other main test cases is 'cheese' eg 'çh', 'che', 'chee', 'chees', 'çheese'

I'm using the following boost settings - product code (2.1) product description (2) product brand (1.5)

I've put together a Google sheet to show 4 examples of where I would like to get different score results. The sheet is at https://docs.google.com/spreadsheets/d/1gKS2nbeF4TivgRcXDDdc6LmLc6-Q0dRksnUSWNvIbZo/edit?usp=sharing

I can provide a json file of the full product data if that helps.

Thanks again for creating and sharing minisearch.

Regards Norm Archibald.

lucaong commented 4 years ago

Hi @normgh , thanks for your kind words!

In general, your case sounds feasible. For the "broccoli" case, I would use a lower fuzziness (0.2 or 0.25 should be enough for misspellings like "brocoli" or "brocolli", 0.5 is quite high and might degrade performances and give you false positives), and enable prefix search to account for searches like br, 'bro', 'broc', etc.

For the 'cheese' case, if you want 'çheese' to match, you should use the processTerm option to perform some normalization (like replacing ç with c).

Does this answer your question?

normgh commented 4 years ago

Hi Luca, thanks for the quick reply. Prefix search was, and still is enabled. I adjusted the fuzzy setting to .2 and also tried other values between .2 and .5, however I am only getting a score match on 'broco' when I have fuzzy set to .5 or above.

lucaong commented 4 years ago

@normgh that's true, but matching broco with broccoli would in general involve a "fuzzy prefix" search, which is not available. Such a feature would be very inefficient and lead to many false positives (broc would also match biochemist, bocadillo, brother, roche, procedure, etc.). You can increase the fuzziness to 0.5 to match this specific case, but it still would not work for longer words, say, apro for appropriation.

Some of those cases of common misspelling can be dealt with normalization. For example, in your case, you could use processTerm to normalize terms upon indexing (and search), in order for example to remove double consonants (so that broccoli would be indexed as brocoli). This is a strategy similar to what stemming does for common inflections, and is also the same thing I suggest for normalizing characters like ç, ü, or ł. Since normalization is heavily dependent to the specific use case, MiniSearch by default merely normalizes casing, and lets you provide your own normalization and stemming if needed by setting a custom processTerm.

In MiniSearch, fuzzy match and prefix search are two distinct strategies:

The two can be combined, but that means that both strategies are executed in parallel, not that the prefix search can be fuzzy.

normgh commented 4 years ago

Thanks for the great explanation. It's very helpful in building my understanding of how minisearch works. Eventually I hope to understand the whole process. While I'm getting up to speed I hope you are ok with me asking a few more questions, and possibly the occasional 'dumb' one or two?

On the query term 'chi' does minisearch give a score for every instance of the characters that are found in the result terms? ie will 'çhow chow' get a score for both instances of the 'çh' characters? If this is the case, then I'm thinking if changes were made so that the additional character instances were not scored then searching on 'chi' would return 'chicken' ahead of 'chow chow', and such a scoring change may also assist in my 'broco' example?

normgh commented 4 years ago

Hi Luca, I found that for each single query term, that each term match is added together which means when searching the term 'broc' with a fuzzy of .5 the product 'Oil Rice Bran Bag N Box' was ranking higher than 'brocolli' as 'Oil Rice Bran Bag N Box' was getting scores for each matched term when IMO it should only get the score from the highest scored matched term. I've made a quick alteration to my code to do this and I'm now getting results that are closer to the ones I'm wanting.

lucaong commented 4 years ago

Hi @normgh , yes, that's correct. Great that you are getting closer to what you need.

In general, having a high "fuzzyness" (and 0.5 is quite high) leads to more false positives. My recommendation would be to deal with common misspelling with normalization, and use a small fuzzyness on top of it (like 0.2 or 0.25) to catch the remaining ones. In your case, you could normalize terms before indexing by removing double consonants and replacing some common non-ASCII characters with ASCII equivalents (ü -> u, ç -> c, etc.). If you do so, broc will not match Oil Rice Bran Bag N Box, and will instead match Broccoli (and broco will match Broccoli if you use prefix search, thanks to the normalization).

A possible processTerm function that would provide a starting point for such normalization is:

const processTerm = (term, _fieldName) =>
  term
    .toLowerCase() // normalize case
    .replace(/([a-z])\1/g, '$1') // normalize double characters
    .replace('ç', 'c') // ...etc.

Of course, consider this a starting point and adapt/optimize it to your needs.

mashpie commented 4 years ago

2cents on normalization: https://www.npmjs.com/package/diacritics helps nicely...