usnistgov / trec_eval

Evaluation software used in the Text Retrieval Conference
232 stars 49 forks source link

Tie-Breaking in runs by docno causes unexpexted behaviour #22

Closed najtin closed 4 years ago

najtin commented 4 years ago

internally ranks will be assigned by sorting by the sim field with ties broken determinstically (using docno).

Why are ties broken using docno? This seams odd because this changes the result of equivalent (except for names) runs.

Example: qrel.txt

1 Q0 a 0
1 Q0 b 1
1 Q0 c 0

run1.txt

1 0 b 1 1.0 run1
1 0 a 2 1.0 run1

run2.txt

1 0 b 1 1.0 run2
1 0 c 2 1.0 run2

These runs are almost identical. They retrieved documents with the same sim-score, relevance and rank. The only difference is the name. Let's take a look at the evaluation.

trec_eval -m ndcg qrel.txt run1.txt

prints ndcg all 1.0000 but

trec_eval -m ndcg qrel.txt run2.txt

prints ndcg all 0.6309 Here i would expect ndcg all 1.0000 I would expect ties to be broken by rank. If that does not break the tie, docno should be used.

isoboroff commented 4 years ago

Ties are broken by docno because it's basically random with respect to the system. (There are some test collections where that's less true, but leave it for now.) If your system has a reason why those documents should be ordered a certain way, it should give them different scores.

najtin commented 4 years ago

Okay, i think i don't fully understand what the rank column does. I expected it to indicate the ranking of the retrieved documents. But if i understand correctly it is intentionally ignored. Does it serve any other purpose then?

najtin commented 4 years ago

If the run/result would be sorted by (sim, rank), there would be no problem, wouldn't it?

Right now trec_eval assumes, that equivalent sim-scores mean the system is indifferent about the order of these documents. Consequently, on a tie any order would be reasonable from the systems point of view.

Since the resolution of the tie is deterministic right now, trec_eval could also use the rank and not lose any property it had before. I just stumble on this because i had a difference of 3%-points in between two evaluations of 'real' data, where the used data was equal apart from the documents having different ids. I might put some research into how much this effected evaluations in the past.

najtin commented 4 years ago

This paper by Guillaume Cabanac et al. is really interesting and has a deeper look at the issue. They did experiments to measure the impact of the order.

This minimal example illustrates the issue addressed in the paper: IRS scores depend not only on their ability to retrieve relevant documents, but also on document names in case of ties. Relying on docno field for breaking ties here implies that the Wall Street Journal collection (WSJdocuments) is more relevant than the Associated Press collection (APdocuments) for whatever the topic, which is definitely wrong. This rationale introduces an uncontrolled parameter in the evaluation regarding all rank-based measures, skewing comparisons unfairly.

They proposed two tie breaking strategies: realistic reordering: qid asc, sim desc, rel asc, docno desc optimistic reordering: qid asc, sim desc, rel desc, docno desc where rel is the relevance of the document.

In Sect. 5.2 we showed that IRS scores are influenced by luck. This is an issue when evaluating several IRSs. Comparing them according to evaluation measures may be unfair, as some may just have been luckier than others. In order to foster fairer evaluations, it may be worth supplying trec_eval with an additional parameter allowing reordering strategy selection: Realistic, Conventional and Optimistic.

They also pointed out why using the rank for tie breaking as i suggested is a bad idea:

Alternatively, relying on the initial ranks (from run) implies the same issue: IRS designers may have untied their run by assigning random ranks, as they were not able to compute a discriminative sim for those documents. As a result, random-based and initial rank-based approaches do not solve the tie-breaking issue.

isoboroff commented 4 years ago

I do not agree with that paper. If your system assigns the same score to two documents, why should the evaluation software assume that any one random shuffle of those two documents is any better or worse than any other?

I recognize that people sometimes get caught assuming that the rank field is meaningful. It's not. However, it's also a historical artifact that persists through 29 years of TREC. For all that time, it's been true that ties are broken by reverse lexicographic docno and not by the rank field. That's not an argument that it's right, but it is an argument that it isn't going to change.

If your system knows some reason that those documents should be sorted in a particular order, then it should assign scores to reflect that.