joaopalotti / trectools

A simple toolkit to process TREC files in Python.
https://pypi.python.org/pypi/trectools
BSD 3-Clause "New" or "Revised" License
163 stars 32 forks source link

Score ties in run files lead to non-stable pools & evaluation #26

Closed lgienapp closed 2 years ago

lgienapp commented 3 years ago

Problem Description

I recently used trectools to evaluate run files which had score ties in them, i.e. multiple documents with the same score given by the retrieval system.

Right now, when reading a run file, the documents are sorted by score. This leads to a problem, as the [default pandas sorting algorithm]() is quick sort, which is not stable. Therefore, the case can arise that the order of document IDs, and thus the top-X documents used in multiple places throughout trectools, is not guaranteed to be the same every time. In my case, this created the issue of a top-5 evaluation having incomplete coverage on a qrel file created from the top-5 pool of the same run files.

Proposed Solutions

As pandas only allows unstable sorting when sorting by multiple columns, the only solution I see would be to add rank as third sorting axis to keep the original order in case of score ties: self.run_data.sort_values(["query","score","rank"], inplace=True, ascending=[True,False,True])

joaopalotti commented 3 years ago

Hi Lukas, sorry for the late reply, and thanks for identifying this issue. Very interesting problem that you are reporting. Could you share with us an example of this issue happening? I wonder if we should also change the sorting functions used in trec_eval.py.

cheers

lgienapp commented 3 years ago

I created a mockup gist to illustrate the behavior in the extreme case. It creates dummy runs with all scores equal to 1. When creating a topX pool and subsequently evaluating at the same X depth, .get_unjudged() reports incomplete coverage in most cases.

joaopalotti commented 2 years ago

Hi @lgienapp, sorry again for the huge delay to get back to you and thanks for the example, it was very useful. Interestingly, using your suggestion did not solve the problem, I still get different coverages every time I run the code. Alternatively, using "docid" as the tie-break solves it and aligns with the original design of trec_eval.

I am going to close this issue with a small PR and feel free to re-open it in case the problem is not yet fixed or you have another suggestion!

Thanks!