Fuzzy score issues - Githubissues

Lespoir commented 2 months ago

Hi, we found an issue when searching an exact match on similar documents:

search options:

 miniSearch.search('butter', { fuzzy: 0.2 })

When the documents are not similar like:

butter
buttery noddles
another thing
another thing 2

When searching for butter, the butter document has a high score

But when the documents are similar like

butter
buttery noddles
butter salted
butter unsalted

when searching for butter, the buttery noddles has the highest score.

This doesnt make sense due butter is an exact match.

I attached a POC with this behaviour. These are the results:

When similar results
[
  { food_name: 'Buttery Noodles', score: 0.6865758707427645 },
  { food_name: 'Butter', score: 0.6047966440700245 },
  { food_name: 'Butter Salted', score: 0.5165637119112676 },
  { food_name: 'Butter Unsalted', score: 0.5165637119112676 }
]
When non similar results
[
  { food_name: 'Butter', score: 2.0415191029874573 },
  { food_name: 'Buttery Noodles', score: 0.6865758707427645 }
]

poc.tar.gz

jacobthecoder commented 2 months ago

Hi @lucaong - I'm experiencing the same problem on my project. Any chance you could take a look? Thanks so much and sorry to bother.

lucaong commented 2 months ago

Hi @Lespoir and @jacobthecoder , thanks for your report. I will look into it. I did not try the POC yet, but I will now. If, like you say, the exact match scores lower than the fuzzy match, this should indeed be considered a bug.

lucaong commented 2 months ago

I can reproduce this, thanks for the POC.

Here is also a quick example in the Node repl:

Welcome to Node.js v19.4.0.
Type ".help" for more information.
> let MiniSearch = require('./dist/cjs/index.cjs')
undefined
> let m = new MiniSearch({ fields: ['text'], storeFields: ['text'] })
undefined
> const docs = [{id: 1, text: 'butter'}, {id: 2, text: 'buttery noddles'}, {id: 3, text: 'butter salted'}, {id: 4, text: 'butter unsalted'}]
undefined
> m.addAll(docs)
undefined
> m.search('butter')
[
  {
    id: 1,
    score: 0.6047966440700245,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter'
  },
  {
    id: 3,
    score: 0.5165637119112676,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter salted'
  },
  {
    id: 4,
    score: 0.5165637119112676,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter unsalted'
  }
]
> m.search('butter', { fuzzy: 0.2 })
[
  {
    id: 2,
    score: 0.6865758707427645,
    terms: [ 'buttery' ],
    queryTerms: [ 'butter' ],
    match: { buttery: [Array] },
    text: 'buttery noddles'
  },
  {
    id: 1,
    score: 0.6047966440700245,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter'
  },
  {
    id: 3,
    score: 0.5165637119112676,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter salted'
  },
  {
    id: 4,
    score: 0.5165637119112676,
    terms: [ 'butter' ],
    queryTerms: [ 'butter' ],
    match: { butter: [Array] },
    text: 'butter unsalted'
  }
]
>

I will work on the issue.

lucaong commented 2 months ago

@jacobthecoder @Lespoir after looking into it, I now understand what is happening, and it is not a bug. I'll try to explain.

When calculating the score, the BM25 algorithm will consider the inverse document frequency, which is how common a term is within the document corpus. A more common term is discounted, while a less common (more specific) term gets a higher score, all else being equal. This is also the case in simpler algorithms such as Tf-Idf.

In the example above, we only have 4 documents, and the term butter appears in 3 of them. It is therefore a common term. On the contrary, the term buttery only appears once, making it a lot more specific. This gives a premium to the result matching buttery.

All else being equal, MiniSearch correctly scores exact matches higher than fuzzy matches, but in this case all else is not equal: the document frequency of butter and buttery is very different (the document frequency of butter being 3 times larger).

Note that the effect is large here because of the small numbers involved: only 4 documents. This would fade away with more documents in the corpus. For example, if we had even just 6 documents instead of 4, the additional ones containing neither butter nor buttery, things would work as expected, with document 2 scoring below the others:

const ms = new MiniSearch({ fields: ['text'] })
const documents = [
  { id: 1, text: 'butter' },
  { id: 2, text: 'buttery noodles' },
  { id: 3, text: 'butter salted' },
  { id: 4, text: 'butter not salted' },
  { id: 5, text: 'chocolate milk' },
  { id: 6, text: 'mozzarella cheese' }
]

ms.addAll(documents)

ms.search('butter', { fuzzy: 0.2 })
// =>
//    [
//        {
//          id: 1,
//          score: 1.2032723527697928,
//          terms: [ 'butter' ],
//          queryTerms: [ 'butter' ],
//          match: { butter: [Array] }
//        },
//        {
//          id: 3,
//          score: 1.0397207708399179,
//          terms: [ 'butter' ],
//          queryTerms: [ 'butter' ],
//          match: { butter: [Array] }
//        },
//        {
//          id: 4,
//          score: 0.9286055739562626,
//          terms: [ 'butter' ],
//          queryTerms: [ 'butter' ],
//          match: { butter: [Array] }
//        },
//        {
//          id: 2,
//          score: 0.9098253523094098,
//          terms: [ 'buttery' ],
//          queryTerms: [ 'butter' ],
//          match: { buttery: [Array] }
//        }
//    ]

If in your case this situation arises on real data, one possibility is to reduce the weight of fuzzy match even more, by passing the weights search option:

// Default is: `weights: { fuzzy: 0.45, prefix: 0.375 }`
miniSearch.search('butter', { fuzzy: 0.2, weights: { fuzzy: 0.3, prefix: 0.2 } })

// The above option can also be passed to the constructor, to set a new default
// and avoid having to pass it to each search:
new MiniSearch({
  // ...other options
  searchOptions: {
    weights: { fuzzy: 0.3, prefix: 0.2 } }
  }
})

That said, in most practical cases this is unlikely to be an issue on real data.

I hope this helps!

Lespoir commented 2 months ago

Hi @lucaong thanks for the context, I reduced the weights and it worked better, however when I tried with real data the problem persists when trying to get 20 results where there are a lot of butter and a few of buttery.

Butter is an exact match and we expect it would be the first but thats not the case.

I tried with different weights with no success. I needed to go down to 0.03 to make it work. That is a very low weight and still the scores are really close.

I attached a POC with the data and the experimentation results.

poc.tar.gz

rolftimmermans commented 2 months ago

Hi @Lespoir, I'm curious about the problem you report because I opened #135 which is the initial implementation of the revised scoring algorithm in MiniSearch.

I'd like to do some further investigation. Is it possible for you to share your actual dataset and your test queries that demonstrate the problem? Not a proof of concept, but the complete dataset that you're using that causes you to run into this problem? Thanks!

Lespoir commented 2 months ago

Hi @rolftimmermans, the latest POC contains the real data we are using in our app. we are getting 20 records from an external source and then using minisearch to search and sort them.

rolftimmermans commented 2 months ago

Thanks @Lespoir. So, just to be clear, this is your production data? Is it the entire dataset?

{"id": "1", "food_name": "Buttery Noodles",},
{"id": "2", "food_name": "Butter",},
{"id": "3", "food_name": "Butter Salted",},
{"id": "4", "food_name": "Butter Unsalted",},
{ "id": "5", "food_name": "Peanut Butter" },
{ "id": "6", "food_name": "Cocoa Butter" },
{ "id": "7", "food_name": "Almond Butter" },
{ "id": "8", "food_name": "Cashew Butter" },
{ "id": "9", "food_name": "Butter Cookies" },
{ "id": "10", "food_name": "Unsalted Butter" },
{ "id": "11", "food_name": "Whipped Butter" },
{ "id": "12", "food_name": "Butter Whipped Salted" },
{ "id": "13", "food_name": "Butter Whipped Unsalted" },
{ "id": "14", "food_name": "Toast with Butter" },
{ "id": "15", "food_name": "Peanut Butter Sandwich" },
{ "id": "16", "food_name": "Ghee Clarified Butter" },
{ "id": "17", "food_name": "Bread with Butter" },
{ "id": "18", "food_name": "Powdered Peanut Butter" },
{ "id": "19", "food_name": "Peanut Butter Toast" },
{ "id": "20", "food_name": "Creamy Peanut Butter" },

Lespoir commented 2 months ago

I'd say yes @rolftimmermans , it is not our entire entire dataset, we use another service to extract these 20 records from a larger dataset and then use those 20 records on minisearch.

rolftimmermans commented 2 months ago

Could you tell me more about how this extraction works? Is it possible to provide the entire dataset?

Lespoir commented 2 months ago

@rolftimmermans the entire data set contains thousands and thousands of records. I dont think we need it for solving this issue.

Basically we have a really big dataset, we use some tools to get a small dataset from there, and that one is the one we use on minisearch.

So for the scope of this issue the entire dataset is just those 20 records. all the extraction from the larger dataset doesnt involve minisearch.

rolftimmermans commented 2 months ago

Hi @Lespoir, thanks. I understand the basic idea, but if you are unable to provide the complete dataset, could you please provide a detailed technical & functional description of how the extraction works?

The reason I am asking: I am thinking about improvements that can be made to the scoring implementation. But to understand your problem completely the nature of the complete dataset and how you use MiniSearch exactly is absolutely critical to my understanding of your situation and how MiniSearch might be improved to suit it.

Lespoir commented 2 months ago

Hi @rolftimmermans, sure!

We query multiple data sources (from different APIs) and use mini search as a way to stitch the data of the multiple data sources together.

Our data set is small because at search time we only have 20 items per data source (usually less than 100 items in total).

All the items in the result are similar because the response of the APIs are search results

rolftimmermans commented 2 months ago

Thank you @Lespoir. Can you elaborate on which functionality MiniSearch provides in your situation? Would you only use it for its ability to rank the search results? Do the results from various APIs include all data from each matched document or only a portion of it?

Lespoir commented 2 months ago

@rolftimmermans some answers:

We use minisearch to rank the search results.
We get all the data we need

lucaong commented 2 months ago

I see @Lespoir, so these 20 documents are already the result of a search, explaining why they are so similar. The problem is that ranking algorithms like BM25 use the document frequency and term frequency as input, but in this case the real document frequency is hidden from MiniSearch: in the original corpus "butter" might have a low relative frequency (say, 19 documents out of 1000s), but MiniSearch only sees the 20 already filtered documents, thus wrongly estimates a frequency of 19 out of 20.

I am interested if @rolftimmermans has some ideas. What might fit this particular use case would be to have access to the ranking functionality alone, and provide frequency data to it explicitly (assuming you can get that from your API). This might or might not fit within MiniSearch scope. If yes, I am happy to help.

rolftimmermans commented 2 months ago

Unfortunately @lucaong is right, you need access to all documents in order to correctly rank documents. The reason is that term frequency is an extremely important signal to determine relevance (excellently explained above by @lucaong). You need all documents to index that.

If your document set contains only matching results, ranking algorithms are pretty much useless because the search term occurs everywhere. It will probably be worse for multiple search terms; because now a ranking algorithm also won't be able to correctly assign relative importance between terms.

In your specific example it appears the issue is that "buttery" is considered a different word than "butter". Normally this is solved by assigning lower relative weights to fuzzy matches. If "butter" occurs in almost all documents, it is considered essentially a stop word and discounted entirely. The skewing of search results is therefore entirely due to the fact that MiniSearch has no access to the source data.

It's not impossible to develop a distributed ranking system, where searching happens in multiple different software systems and results are combined in one place before showing the results. But that is not what MiniSearch is.

One issue that your question raises is: "butter" and "buttery" are treated as different words. If they were seen as the same word then the exact match would rise to the top. This is a consequence of MiniSearch's goals of being compact, fast and versatile. Another approach is to apply stemming (which would transform "buttery" -> "butter"), which you could plug in to MiniSearch via the processTerm() function, but you'd need to bring your own algorithm.

I've tried to investigate if there is a way to score frequencies based on the query terms rather than the words in the document (provided they match with exact/fuzzy/prefix matching, of course). But my (preliminary) conclusion is that this is not possible to index ahead of time and therefore out of reach in terms of computational requirements.

lucaong commented 2 months ago

I've tried to investigate if there is a way to score frequencies based on the query terms rather than the words in the document (provided they match with exact/fuzzy/prefix matching, of course)

@rolftimmermans I was initially thinking in the same direction, but I concluded like you did that it's not feasible, and also I suspect that formally the current approach is the correct one. When using prefix/fuzzy match, it's really the frequency of the "derived" term that matters for ranking, not the query term. For example, when fuzzy match resolves a typo, the frequency of the (mistyped) query term would likely be zero in the corpus.

@Lespoir I believe your options are:

Index the whole source corpus with MiniSearch and use it to search, instead of relying on the API. This assumes that you have access to the corpus, and that its volume is such that it can be transferred and indexed on the client side.
Alternatively, use your current implementation, but refrain from using fuzzy match. For example, as @rolftimmermans said, use only exact match, and use stemming to resolve different forms of the same word. This would improve things by making sure that the search term is present in all documents, so the ranking would essentially use only the field length and frequency of the term in each retrieved document. You might still get sub-optimal ranking when searching for multiple terms, but it might make sense to test on your real data to see. There are many JS stemmer implementations that come as a single function that one can plug into MiniSearch using the processTerm option.

If you have control over the APIs, there would possibly be other options, like exposing term frequency data in the API results and implementing ranking on the client side. But that would be a different project, possibly taking inspiration from the BM25 implementation inside MiniSearch.

Lespoir commented 2 months ago

@rolftimmermans @lucaong, thank you for taking the time to review the issue and provide a detailed explanation. We will analyse our use case and explore our options.

lucaong commented 1 month ago

I will close the issue as the main questions should be answered, but @Lespoir feel free to comment further if you have more questions.

lucaong / minisearch

Fuzzy score issues #263