joaopalotti / trectools

A simple toolkit to process TREC files in Python.
https://pypi.python.org/pypi/trectools
BSD 3-Clause "New" or "Revised" License
167 stars 31 forks source link

MAP function does not consider properly ranking order #14

Closed tonifuc3m closed 4 years ago

tonifuc3m commented 4 years ago

I think the method get_map() within trec_eval.py does not take into account ranking order properly. In line 337 the run_data dataframe is ordered by the value of query, score and docid (does not take into account rank): trecformat = self.run.run_data.sort_values(["query", "score", "docid"], ascending=[True,False,False]).reset_index()

Then, ranking column is artificially created in lines 346-347: topX["rank"] = 1 topX["rank"] = topX.groupby("query")["rank"].cumsum()

joaopalotti commented 4 years ago

Hi there,

Thanks for using trectools!

I believe there is nothing wrong with the code that you shared. However, the documentation is quite poor at the moment. So let me try to explain what is going on there.

Line 337 is there to force trectools using the same sorting as the original TREC Eval program. Note that TREC Eval ignores the ranking column and, instead, sorts the documents by their scores and docids (you can find it somewhere in their code: https://github.com/usnistgov/trec_eval).

Here we leave the option to the user. You can force get_map to sort as TREC Eval does with trec_eval=True (this is the default) or not (lines 339-340, which does not sort and respect any initial document order you used). Note that, either way, we create a topX with only 3 cols: ["query","docid","score"], although score is not used anymore and could be removed. Note topX has no col named rank.

Lines 346-347, as you pointed out, created an artificial col rank, because MAP uses the document rank in its formula. However, note that this col rank is not the same as the original col rank from self.run.run_data, which we do not use anymore at this point of the code.

Let me know if that is clear and/or you can find any instance in which get_map() returns a value that is different from the original TREC program.

Thanks,

Joao