teamtnt / tntsearch

A fully featured full text search engine written in PHP
https://tnt.studio/solving-the-search-problem-with-laravel-and-tntsearch
MIT License
3.1k stars 291 forks source link

Use relative term frequency instead of absolute #317

Open BlackbitDevs opened 3 months ago

BlackbitDevs commented 3 months ago

In https://github.com/teamtnt/tntsearch/blob/a763e66ca1bdebf1fab8f9ed10ec4bdbe2682bb0/src/TNTSearch.php#L113-L122

the score gets calculated. The problem here is that $document['hit_count'] returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.

Example: Let's take the example from https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Example_of_tf%E2%80%93idf but with a little changed hit counts We have 4 documents: Document 1 Term Hit count
this 1
is 1
a 2
sample 1
Document 2 Term Hit count
this 1
is 1
another 2
example 3
Spam document Term Hit count
this 10
is 10
a 10
sample 10
another 10
example 10

We search for "example": idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.

tf how it is currently calculated in TNT search:

// document 1 does not get found because it does not contain term "example"

// document 2
$tf    = $document['hit_count'];  // = 3
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 3 = 6
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight) // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 3 = 4
$score             = $idf * ($num / $denom); // = 0.1761 * (6/4) = 0.26415

// spam document
$tf    = $document['hit_count'];  // = 10
$num   = ($tfWeight + 1) * $tf;  // = (1 + 1) * 10 = 20
$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)  // this calculation is already weird to me because (1 - $dlWeight) + $dlWeight) = 1 -> so what is this good for?
    + $tf;                                     // = 1 * ((1 - 0.5) + 0.5) + 10 = 11
$score             = $idf * ($num / $denom); // = 0.1761 * (20/11) = 0.32

So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.

According to Wikipedia's calculation term frequency is

Term frequency, tf(t,d), is the relative frequency of term t within document d, where ft,d [the numerator] is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document d (counting each occurrence of the same term separately).

(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)

So with Wikipedia's definition of term frequency, there would be the following results:

tf(Document 2, 'example') = hit count of term / sum(hit_count) = 3 / 7 = 0.4286
tf(Spam document, 'example') = 10 / 60 = 0.1666

score(Document 2) = 0.1761 * 0.4286 = 0.0755
score(Spam document) = 0.1761 * 0.1666 = 0.0293
nticaric commented 3 months ago

The algorithm we are using here is BM25. You can learn more about it here

The $dlWeight variable represents the b coefficient which serves for ranking. In practice, the optimal range for this coefficient is typically between 0.5 and 0.8

In your extreme case, the spam document would rank better, but not to much. In practice a document where a word occurs more times is more relevant. By doing a lot of search against documents, we decided to keep the coefficient 0.5

Therefore, changing hit_count to float does not make sense, since this tells us the number how many times a token occurs in a document which can only be an integer

BlackbitDevs commented 3 months ago

Ok, will read about BM25 later.

But how can ((1 - $dlWeight) + $dlWeight) in

$denom = $tfWeight 
    * ((1 - $dlWeight) + $dlWeight)
    + $tf;

make any difference as this is always 1? In BM25 it calculates (1 - b + b * something) - but this something is missing in TNT Search's formula or the brackets are wrong.

nticaric commented 2 months ago

Sorry for the late reply! You are correct; the expression always equals 1. The formula has been modified over the years, and this was overlooked. I remember that we changed the BM25 formula to ignore the document length, which is why a parameter is missing.