Open BlackbitDevs opened 3 months ago
The algorithm we are using here is BM25. You can learn more about it here
The $dlWeight
variable represents the b
coefficient which serves for ranking. In practice, the optimal range for this coefficient is typically between 0.5 and 0.8
In your extreme case, the spam document would rank better, but not to much. In practice a document where a word occurs more times is more relevant. By doing a lot of search against documents, we decided to keep the coefficient 0.5
Therefore, changing hit_count
to float does not make sense, since this tells us the number how many times a token occurs in a document which can only be an integer
Ok, will read about BM25 later.
But how can ((1 - $dlWeight) + $dlWeight)
in
$denom = $tfWeight
* ((1 - $dlWeight) + $dlWeight)
+ $tf;
make any difference as this is always 1
? In BM25 it calculates (1 - b + b * something)
- but this something is missing in TNT Search's formula or the brackets are wrong.
Sorry for the late reply! You are correct; the expression always equals 1. The formula has been modified over the years, and this was overlooked. I remember that we changed the BM25 formula to ignore the document length, which is why a parameter is missing.
In https://github.com/teamtnt/tntsearch/blob/a763e66ca1bdebf1fab8f9ed10ec4bdbe2682bb0/src/TNTSearch.php#L113-L122
the score gets calculated. The problem here is that
$document['hit_count']
returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.We search for "example": idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.
tf how it is currently calculated in TNT search:
So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.
According to Wikipedia's calculation term frequency is
(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)
So with Wikipedia's definition of term frequency, there would be the following results: