Closed Sewens closed 3 years ago
Oh finally I‘ve figured out what's going wrong. I tried to evaluating my ranking model on Java language, so I only loaded Java annotations data from https://github.com/github/CodeSearchNet/blob/master/resources/annotationStore.csv. There are 5 to 10 annotated code snippets for each query where there are only a few unrelated codes. The lack of 0 labels causes relatively higher NDCG scores. For fix this, each of the annotations for the other 5 languages are considered to be a negative sample of Java evaluation. Then the case was closed.
Thanks for looking into this @Sewens!
I've noticed that the official calculation of NDCG is here.
https://github.com/github/CodeSearchNet/blob/3f999d599a2383ca5f47f1d7b745316ec7db86d9/src/relevanceeval.py#L75
On this basis the original paper has reported the NDCG of six languages on code search tasks.
While I re-implement a baseline search model based on MLP. And I calculate the MRR MAP and NDCG metrics by myself.
The MRR is
0.5128467211800546
, MAP is0.19741363623638755
and NDCG is0.6274463943748803
.Both MRR and MAP seem great, but NDCG is nearly 3 times outperform than the results which the original paper noticed.
I think it's may not be the power of my baseline modal, there's must be something wrong with the NDCG implementation.
Here's the function I used for calculation:
I'm confused about how original paper calculates NDCG metrics, especially how to choose the threshold K of NDCG@K which is not noticed in paper.
Pls help.