Describe the bug
It's not really a bug, rather a suggestion for an improvement. Currently, the inverse document frequency (IDF) in BM25 is calculated using Math.log((N - n + 0.5) / (n + 0.5)). However, this way negative IDF values are possible, for example when the term occurs in every single document of the corpus. Since the other part of the BM25 weighing scheme is multiplied with the IDF, also the overall BM25 score can become negative. This is typically undesired and the reason why other people recommend to use Math.log((N - n + 0.5) / (n + 0.5) + 1) instead for calculating the IDF. Wikipedia states this IDF formula and other relevant NLP platforms like Lucene seem to use similar approaches as well for calculating BM25. For this reason, I would suggest to change the IDF formula within the BM25 class accordingly.
Expected behavior
The calculated BM25 score should always be greater than or equal to zero.
Actual behavior
The calculated BM25 score may become negative if the respective term occurs in a high proportion of the documents within the corpus.
Code snippet
int freq = 3;
int docSize = 100;
int avgDocSize = 150;
int N = 10000000;
int n = 9999900;
double result = new BM25().score(freq, docSize, avgDocSize, N, n);
System.out.printf("BM25 score: %f%n", result);
Describe the bug It's not really a bug, rather a suggestion for an improvement. Currently, the inverse document frequency (IDF) in BM25 is calculated using
Math.log((N - n + 0.5) / (n + 0.5))
. However, this way negative IDF values are possible, for example when the term occurs in every single document of the corpus. Since the other part of the BM25 weighing scheme is multiplied with the IDF, also the overall BM25 score can become negative. This is typically undesired and the reason why other people recommend to useMath.log((N - n + 0.5) / (n + 0.5) + 1)
instead for calculating the IDF. Wikipedia states this IDF formula and other relevant NLP platforms like Lucene seem to use similar approaches as well for calculating BM25. For this reason, I would suggest to change the IDF formula within the BM25 class accordingly.Expected behavior The calculated BM25 score should always be greater than or equal to zero.
Actual behavior The calculated BM25 score may become negative if the respective term occurs in a high proportion of the documents within the corpus.
Code snippet
Prints:
In PR #683 the suggested changes are implemented.
Thanks for having a look!