question about NDCG calculation

Sewens commented 3 years ago

I've noticed that the official calculation of NDCG is here.

https://github.com/github/CodeSearchNet/blob/3f999d599a2383ca5f47f1d7b745316ec7db86d9/src/relevanceeval.py#L75

On this basis the original paper has reported the NDCG of six languages on code search tasks.

NDCG

While I re-implement a baseline search model based on MLP. And I calculate the MRR MAP and NDCG metrics by myself.

The MRR is 0.5128467211800546, MAP is 0.19741363623638755 and NDCG is 0.6274463943748803.

Both MRR and MAP seem great, but NDCG is nearly 3 times outperform than the results which the original paper noticed.

I think it's may not be the power of my baseline modal, there's must be something wrong with the NDCG implementation.

Here's the function I used for calculation:

def iDCG(true_list:list,topk=-1):

    true_descend_list = np.sort(true_list)[::-1]

    idcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
    if not topk==-1:
        idcg_list = idcg_list[:topk]
    idcg = np.sum(idcg_list)
    return idcg

def DCG(true_list,pred_list,topk=-1):

    pred_descend_order = np.argsort(pred_list)[::-1]

    true_descend_list = [true_list[i] for i in pred_descend_order]

    dcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]

    if not topk==-1:
        dcg_list = dcg_list[:topk]

    dcg = np.sum(dcg_list)
    return dcg

def NDCG(true_rank_dict,pred_rank_dict,topk=-1):

    ndcg_lst = []

    for qid in pred_rank_dict:
        temp_pred = pred_rank_dict[qid]
        temp_true = true_rank_dict[qid]

        idcg = iDCG(true_list=temp_true,topk=topk)

        dcg = DCG(true_list=temp_true,pred_list=temp_pred,topk=topk)

        ndcg = dcg / idcg if not idcg == 0 else 0

        ndcg_lst.append(ndcg)

    return np.average(ndcg_lst)

I'm confused about how original paper calculates NDCG metrics, especially how to choose the threshold K of NDCG@K which is not noticed in paper.

Pls help.

Sewens commented 3 years ago

Oh finally I‘ve figured out what's going wrong. I tried to evaluating my ranking model on Java language, so I only loaded Java annotations data from https://github.com/github/CodeSearchNet/blob/master/resources/annotationStore.csv. There are 5 to 10 annotated code snippets for each query where there are only a few unrelated codes. The lack of 0 labels causes relatively higher NDCG scores. For fix this, each of the annotations for the other 5 languages are considered to be a negative sample of Java evaluation. Then the case was closed.

mallamanis commented 3 years ago

Thanks for looking into this @Sewens!

github / CodeSearchNet

question about NDCG calculation #226