haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

Negative scores in BM25 #682

Closed schneijan closed 3 years ago

schneijan commented 3 years ago

Describe the bug It's not really a bug, rather a suggestion for an improvement. Currently, the inverse document frequency (IDF) in BM25 is calculated using Math.log((N - n + 0.5) / (n + 0.5)). However, this way negative IDF values are possible, for example when the term occurs in every single document of the corpus. Since the other part of the BM25 weighing scheme is multiplied with the IDF, also the overall BM25 score can become negative. This is typically undesired and the reason why other people recommend to use Math.log((N - n + 0.5) / (n + 0.5) + 1) instead for calculating the IDF. Wikipedia states this IDF formula and other relevant NLP platforms like Lucene seem to use similar approaches as well for calculating BM25. For this reason, I would suggest to change the IDF formula within the BM25 class accordingly.

Expected behavior The calculated BM25 score should always be greater than or equal to zero.

Actual behavior The calculated BM25 score may become negative if the respective term occurs in a high proportion of the documents within the corpus.

Code snippet

int freq = 3;
int docSize = 100;
int avgDocSize = 150;
int N = 10000000;
int n = 9999900;
double result = new BM25().score(freq, docSize, avgDocSize, N, n);
System.out.printf("BM25 score: %f%n", result);

Prints:

BM25 score: -30,982883

In PR #683 the suggested changes are implemented.

Thanks for having a look!

haifengl commented 3 years ago

Merged. Thanks!