lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.9k stars 137 forks source link

Strange scoring #116

Closed Tobbe closed 2 years ago

Tobbe commented 2 years ago

Hi

Thanks for a great (and fast!) search library.

I'm trying to tweak the scoring, but I'm a bit confused about the results I'm seeing. I made a CodePen (sorry about the Swedish in the example data)

https://codepen.io/tobbe_lundberg/pen/Jjrgebz

The top result is great, but I don't understand why results 2 and 3 come before number 4. I've boosted 'name', so I was expecting any result with a match in the name would come before results with no match in the name. Can you explain what's going on here?

Thanks a lot!

lucaong commented 2 years ago

Hi @Tobbe , Thanks for the kind words!

Indeed the scoring is a bit surprising in your example. It seems that the description field is scored much higher than the name field. If one boosts the name by a high amount, say 100, the results look more like expected. I am not sure yet of the reason, I will be able to give it a detailed look later.

Tobbe commented 2 years ago

Thanks for getting back to me @lucaong

I never thought to try with a boost that high. I'll play around a bit more with higher boosts to see where that gets me 🙂 Looking forward to any/all insights you can provide later, when you've had a chance to give this a proper look 👍

lucaong commented 2 years ago

At a first glance, I think that the issue might be that the term Rattnav is overall much less frequent in the description than in the name, and therefore the TF-IdF scoring assigns more relevance to the few descriptions that do contain this term. Hence the need for such a large boosting (which looks atypical for me too).

As an example, imagine we had 10 documents in total. Descriptions are all 10 terms long, and one document contains our term in the description. Names are two terms long, and 5 documents contain the term. I will also use a simplified version of Tf-IdF just for the sake of a simpler example. The scoring of a document containing the term in the description would be:

term frequency: 1/10, document frequency: 1/100, simplified TF-IdF: 10

For a document containing the term in the name:

term frequency: 1/2, document frequency: 5/20, simplified Tf-IdF: 2

So in this simplified example one would have to boost the name by at least 5 to make results containing the term as relevant as the ones containing it in the description.

I still want to perform the real calculations to determine if in your case we are seeing the expected results, or if there is an underlying issue skewing them more than expected.

Tobbe commented 2 years ago

Some of my descriptions can be up towards 500 terms. A typical description is probably 20 - 50 words. And a typical name is probably 2 - 10 words. Maybe I just do need a really high boost :)

lucaong commented 2 years ago

Sorry my calculations from above are not correct, as I am using a wrong definition of the document frequency. I must be still sleepy 😄 I will post correct calculations later, but I think that the main conclusions hold

lucaong commented 2 years ago

Ok I analyzed your case better, and now I can confidently say that the result is expected. The main reason for the scoring you observe is that there are 25 documents in total, out of which 23 contain the term “Rattnav” in the name field. Since almost all documents contain the term in the name, the calculated tf-idf score is quite low (because none of the documents “stand out”).

Conversely, finding the term in the long_desc field increases the score by a lot more, because only 3 documents out of 25 contain the term in the long_desc.

More specifically, the Idf part of the score is calculated as log(n/d), with n the total number of documents and d the number of documents containing the term. In this case, plugging the numbers, it is log(25/23) =~ 0.08 for a document containing the term in name, and log(25/3) =~ 2.12 for a document containing the term in long_desc. All other things being equal, the score of a document containing the term only in long_desc is scored over 25 times more than a document containing it only in name.

I hope this helps you to clarify your issue. Your case is a bit atypical because almost all documents contain the search term in name. If you would add, for example, 100 more documents that do not contain the term, the results would end up being scored more as you expect.

Let me know if more information is needed.

Tobbe commented 2 years ago

Thanks a lot Luca for the detailed explanation.

For the CodePen I just took a sample of my real data. In my actual code I add 7680 items to MiniSearch, and when searching for "Rattnav" I get 34 hits. So for my actual implementation I "only" had to boost name to 10 to get the results I wanted 🙂