terrier-org / terrier-core

Terrier IR Platform
http://terrier.org/
Other
250 stars 63 forks source link

Negative scores from DirichletLM #248

Open ojwb opened 3 months ago

ojwb commented 3 months ago

In modules/core/src/main/java/org/terrier/matching/models/DirichletLM.java a comment says:

This model is formulated such that all scores are > 0.

and also:

This has one parameter, mu > 0.

The score is calculated by:

        public double score(double tf, double docLength) {
                return WeightingModelLibrary.log(1 + (tf/(mu * (super.termFrequency / numberOfTokens))) ) + WeightingModelLibrary.log(mu/(docLength+mu));
        }

As above, mu is positive; also docLength should be >= 0, so mu/(docLength+mu) will be <= 1 and the second log will be <= 0. It seems to me that given a large enough value for docLength the calculated score can also end up negative.

There are restrictions on the relationships between the possible values of some of the variables in the equation (e.g. numberOfTokens is the sum of docLength over all documents) so let's try to find an actual example where we get a negative score.

Rearranging, this happens when:

docLength > tf * numberOfTokens / super.termFrequency

Say we have 10 documents, one length 11, the rest length 1 (so numberOfTokens is 20). One term occurs once in all of them (so its super.termFrequency is 10), and 10 other different terms occur once in the long document. This seems to satisfy the criteria I derived above: 11 > 1 * 20 / 10

Checking the value calculated by the formula in score() in this case for the common term in the long document we indeed seem to get a negative score:

$ python -c 'from math import *; print(log2(1 + (1/(2500 * (10 / 20))) )+log2(2500/(11+2500)))'
-0.005180239105680202

Have I misunderstood something here?

ojwb commented 3 months ago

Looking equation 6 in the Zhai & Lafferty 2004 paper linked to from the source file, I think the formula Terrier is using isn't quite right either.

As the paper notes, "the last term on the righthand side is independent of the document d , and thus can be ignored in ranking".

The problem is the second term is n log αd where n is the number of terms in the query, but Terrier instead adds log αd to the formula computed for each matching term, so effectively multiplies log αd by the number of matching query terms for each document, which is different for documents which don't match all query terms.

cmacdonald commented 1 month ago

Hi Olly (@ojwb). Good to hear from you, and sorry for the delay. I missed this coming in.

So I wasn't involved in the Dirichlet formulation in Terrier. I suspect that the original authors were aiming for an implementation that always gave a positive score, as Terrier (at that point) (and other systems?) assume that scores should be positive (e.g. for pruning).

Happy to test other formulations on our test collections.

ojwb commented 3 weeks ago

Hi Craig, hope you're well,

Meanwhile I've dug further into this, and (I think) now understand things better. For background I was trying to fix Xapian's LM weighting implementation which was also wrong, and looking at Terrier's implementation as well as the original papers for guidance.

I've now fixed Xapian's implementation. The trick used to avoid negative per-term weight contributions is to add an extra component to each term's contribution such that it's always positive, then subtract the sum of these components so the formula gives the same result - we arrange that this sum only depends on the document so it's a term-independent weight contribution - w(doc) in this formula:

weight(doc) = sum_over_term(w(term,doc)) + w(doc)

We can also add on a fixed (for a given query and document collection) contribution to every weight (which won't affect the ranking) such that w(doc) >= 0 (for pruning).

However, this approach doesn't seem to work within Terrier's existing framework as it doesn't seem to support a term-independent weight component (LM with Jelinek-Mercer smoothing is the only LM model I've looked at which doesn't seem to need this).