Average over the queries in the intersection of relevance judgements and results

Yu-Shi commented 3 years ago

Hello, I'm recently using the trectools and find that the evaluation is not the same as trec_eval. In trec_eval, the evaluation is always averaged over the queries in the intersection of relevance judgements and results. However, in trectools, the evaluation is averaged over all queries in the results, and some of them may not be judged; they should be ignored rather than contribute to a value of 0.

joaopalotti commented 3 years ago

Hi Yu-Shi,

Thank you for your message and for using our package. That is a good point! Would you be able to provide a minimal example of the situation you are reporting?

Thanks,

Joao

Yu-Shi commented 3 years ago

Hello,

Thanks for your response!

You can use these runs: https://github.com/thunlp/ConversationQueryRewriter/tree/master/results and use the judgment here: https://trec.nist.gov/data/cast/2019qrels.txt. The evaluation metric is NDCG@3. If you run trec_eval and trectools, you will find that they give different results.

Please note that due to some errors, the evaluation results provided in the readme of ConversationQueryRewriter are slightly lower than the correct ones. I will fix them in the future.

Thanks,

Shi Yu

joaopalotti commented 3 years ago

Hi Shi Yu,

Thanks for report this issue. I pushed a fix to a separate branch. Please see here. It solves most of your problems, but there are still some pending things to do here.

Running a few tests locally, I noticed that:

The BPref results are not the same as trec_eval. The difference is for queries that have all the documents assessed as relevant (e.g., 31_3 of your qrels file).
num_ret differs from trectools and trec_eval, not sure why.

Would you be so kind to have a look at the remaining issues?

Thank you very much,

Joao

Yu-Shi commented 3 years ago

Hi Joao,

Thank you for the quick fixing. I haven't had a chance to check the things you said, but I check the NDCG@3 on the file bert_base_run_oracle.trec and find that the updated version still doesn't give the same result as trec_eval does: 0.5474 (trec_eval), 0.5488 (trectools).

I took some time to investigate it and found that the difference comes from query 69_2. The run is like that:

...
69_2 Q0 MARCO_38426 1 3.800645589828491 BERT-base
69_2 Q0 MARCO_3640412 2 3.6633083820343018 BERT-base
69_2 Q0 CAR_1e496da6dbe1e1c2d80b30970038b92a57799749 3 3.6544055938720703 BERT-base
69_2 Q0 MARCO_2900384 4 3.6544055938720703 BERT-base
69_2 Q0 MARCO_4056499 5 3.4888365268707275 BERT-base
...

The first three documents in that run (MARCO_38426, MARCO_3640412, and CAR_1e496da6dbe1e1c2d80b30970038b92a57799749) all have the relevance judgements of 4, thus trectools assign a NDCG@3 of 1. However, the result for this query from trec_eval is 0.7654; note that the 3rd and 4th document have exactly the same score, but MARCO_2900384 is an unjudged document (which should be treated as irrelevant) and I think that trec_eval does some kinds of sorting to place MARCO_2900384 in front of CAR_1e496da6dbe1e1c2d80b30970038b92a57799749.

It seems that trectools is aware of the document sorting problem and has such code to do the sorting (eg. in get_map() ):

        if trec_eval:
            trecformat = self.run.run_data.sort_values(["query", "score", "docid"], ascending=[True,False,False]).reset_index()
            topX = trecformat.groupby("query")[["query","docid","score"]].head(depth)
        else:
            topX = self.run.run_data.groupby("query")[["query","docid","score"]].head(depth)

But I can't find such a snippet in get_ndcg(). Is that a bug or for some special reasons?

Thank you again for your attention,

Shi Yu

Yu-Shi commented 3 years ago

For the problem of BPref, I set per_query=True and find that some queries disappear from the result list, such as 59_6, 61_3, 67_7. They are all judged, so they should appear in the per query list.

Yu-Shi commented 3 years ago

I'm sorry I may not have enough time to follow up on this recently. I find a note that may help you correct the implementation of BPref: https://github.com/usnistgov/trec_eval/blob/master/bpref_bug. Thank you a lot!

joaopalotti commented 3 years ago

I check the NDCG@3 on the file bert_base_run_oracle.trec and find that the updated version still doesn't give the same result as trec_eval does: 0.5474

Hi Shi Yu,

You are right. There was an alternative tool that calculated NDCG without re-ranking ties, but it is better to use the same approach for all metrics within trec_tools. I will update it as well.

joaopalotti commented 3 years ago

Regarding BPref, you are absolutely right. The line to blame is very likely this one here. Let me know if you be available to help to fix it. Otherwise, I can work on that in May.

Thank you very much in advance!

Yu-Shi commented 3 years ago

I'm willing to help, but I'm busy with other things recently. Sorry to say that.

Looking forward to the updated version! 😊

snovaisg commented 2 years ago

Hi! I recently came across the same issue as OP: topics without qrels shouldn't contribute as 0.

Any updates on the fix? I can provide help if you need @joaopalotti

joaopalotti commented 2 years ago

Hi @snovaisg, thanks for your message. I will have some free time this weekend and should be able to work on this, but your help would be highly appreciated. I started a fix on this branch, but some metrics still need work, such as bpref.

snovaisg commented 2 years ago

Ok will take a look! Also have some free time this weekend to work on this.

joaopalotti commented 2 years ago

Hi @snovaisg, have you had any progress? Thanks!

joaopalotti commented 2 years ago

Hi @snovaisg and @Yu-Shi. I finally had some available time and fixed this problem. Let me know if you have further comments and thanks for this fruitful discussion and suggestions for improvement.

joaopalotti / trectools

Average over the queries in the intersection of relevance judgements and results #24