Closed normgh closed 4 years ago
Hi @normgh , thanks for your kind words!
In general, your case sounds feasible. For the "broccoli" case, I would use a lower fuzziness (0.2 or 0.25 should be enough for misspellings like "brocoli" or "brocolli", 0.5 is quite high and might degrade performances and give you false positives), and enable prefix search to account for searches like br
, 'bro', 'broc', etc.
For the 'cheese' case, if you want 'çheese' to match, you should use the processTerm
option to perform some normalization (like replacing ç
with c
).
Does this answer your question?
Hi Luca, thanks for the quick reply. Prefix search was, and still is enabled. I adjusted the fuzzy setting to .2 and also tried other values between .2 and .5, however I am only getting a score match on 'broco' when I have fuzzy set to .5 or above.
@normgh that's true, but matching broco
with broccoli
would in general involve a "fuzzy prefix" search, which is not available. Such a feature would be very inefficient and lead to many false positives (broc
would also match biochemist
, bocadillo
, brother
, roche
, procedure
, etc.). You can increase the fuzziness to 0.5 to match this specific case, but it still would not work for longer words, say, apro
for appropriation
.
Some of those cases of common misspelling can be dealt with normalization. For example, in your case, you could use processTerm
to normalize terms upon indexing (and search), in order for example to remove double consonants (so that broccoli
would be indexed as brocoli
). This is a strategy similar to what stemming does for common inflections, and is also the same thing I suggest for normalizing characters like ç
, ü
, or ł
. Since normalization is heavily dependent to the specific use case, MiniSearch
by default merely normalizes casing, and lets you provide your own normalization and stemming if needed by setting a custom processTerm
.
In MiniSearch
, fuzzy match and prefix search are two distinct strategies:
The two can be combined, but that means that both strategies are executed in parallel, not that the prefix search can be fuzzy.
Thanks for the great explanation. It's very helpful in building my understanding of how minisearch works. Eventually I hope to understand the whole process. While I'm getting up to speed I hope you are ok with me asking a few more questions, and possibly the occasional 'dumb' one or two?
On the query term 'chi' does minisearch give a score for every instance of the characters that are found in the result terms? ie will 'çhow chow' get a score for both instances of the 'çh' characters? If this is the case, then I'm thinking if changes were made so that the additional character instances were not scored then searching on 'chi' would return 'chicken' ahead of 'chow chow', and such a scoring change may also assist in my 'broco' example?
Hi Luca, I found that for each single query term, that each term match is added together which means when searching the term 'broc' with a fuzzy of .5 the product 'Oil Rice Bran Bag N Box' was ranking higher than 'brocolli' as 'Oil Rice Bran Bag N Box' was getting scores for each matched term when IMO it should only get the score from the highest scored matched term. I've made a quick alteration to my code to do this and I'm now getting results that are closer to the ones I'm wanting.
Hi @normgh , yes, that's correct. Great that you are getting closer to what you need.
In general, having a high "fuzzyness" (and 0.5 is quite high) leads to more false positives. My recommendation would be to deal with common misspelling with normalization, and use a small fuzzyness on top of it (like 0.2 or 0.25) to catch the remaining ones. In your case, you could normalize terms before indexing by removing double consonants and replacing some common non-ASCII characters with ASCII equivalents (ü
-> u
, ç
-> c
, etc.). If you do so, broc
will not match Oil Rice Bran Bag N Box
, and will instead match Broccoli
(and broco
will match Broccoli
if you use prefix search, thanks to the normalization).
A possible processTerm
function that would provide a starting point for such normalization is:
const processTerm = (term, _fieldName) =>
term
.toLowerCase() // normalize case
.replace(/([a-z])\1/g, '$1') // normalize double characters
.replace('ç', 'c') // ...etc.
Of course, consider this a starting point and adapt/optimize it to your needs.
2cents on normalization: https://www.npmjs.com/package/diacritics helps nicely...
Hi Luca,
Congratulations, and thank you, for writing such great code and creating such an excellent client side search solution.
Minisearch is awesome!
I'm very keen to implement it in an application that searches approx 6000 food products and I'm hoping that you may be able to give me some advise on the best way to improve some of the score results that I'm getting on my data.
My customers search by product codes, or product descriptions and/or product brands so those are the 3 fields that I'm searching on.
I'm experimenting with fuzzy settings of around .5 to catch spelling issues on words like broccoli, for which I'm using test cases of 'br', 'bro', 'broc', 'broco', 'brocol', 'brocoli' etc
One of my other main test cases is 'cheese' eg 'çh', 'che', 'chee', 'chees', 'çheese'
I'm using the following boost settings - product code (2.1) product description (2) product brand (1.5)
I've put together a Google sheet to show 4 examples of where I would like to get different score results. The sheet is at https://docs.google.com/spreadsheets/d/1gKS2nbeF4TivgRcXDDdc6LmLc6-Q0dRksnUSWNvIbZo/edit?usp=sharing
I can provide a json file of the full product data if that helps.
Thanks again for creating and sharing minisearch.
Regards Norm Archibald.