Closed tonifuc3m closed 4 years ago
Hi there,
Thanks for using trectools!
I believe there is nothing wrong with the code that you shared. However, the documentation is quite poor at the moment. So let me try to explain what is going on there.
Line 337 is there to force trectools using the same sorting as the original TREC Eval program. Note that TREC Eval ignores the ranking column and, instead, sorts the documents by their scores and docids (you can find it somewhere in their code: https://github.com/usnistgov/trec_eval).
Here we leave the option to the user. You can force get_map to sort as TREC Eval does with trec_eval=True
(this is the default) or not (lines 339-340, which does not sort and respect any initial document order you used).
Note that, either way, we create a topX
with only 3 cols: ["query","docid","score"]
, although score
is not used anymore and could be removed. Note topX
has no col named rank
.
Lines 346-347, as you pointed out, created an artificial col rank
, because MAP uses the document rank in its formula. However, note that this col rank
is not the same as the original col rank
from self.run.run_data
, which we do not use anymore at this point of the code.
Let me know if that is clear and/or you can find any instance in which get_map() returns a value that is different from the original TREC program.
Thanks,
Joao
I think the method get_map() within trec_eval.py does not take into account ranking order properly. In line 337 the run_data dataframe is ordered by the value of query, score and docid (does not take into account rank):
trecformat = self.run.run_data.sort_values(["query", "score", "docid"], ascending=[True,False,False]).reset_index()
Then, ranking column is artificially created in lines 346-347:
topX["rank"] = 1 topX["rank"] = topX.groupby("query")["rank"].cumsum()