Testing Performance of Code Search using NL-PL Embbedding Models on CAT benchmark

Hello all. Following the suggestion of Daya Guo (developer of CodeBERT - this issue https://github.com/microsoft/CodeBERT/issues/174) on how to design the code search using the current NL-PL embedding module, I implement the code search function on the test part of the CAT dataset ((https://arxiv.org/abs/2207.05579). You can find the details of the implementation in this repository:

https://github.com/pdhung3012/CodeBERT/tree/master/test_on_cat_dataset

We use 5 models: UnixCoder, GraphCodeBERT, CodeBERT, Roberta, BERT. For UnixCoder, GraphCodeBERT and BERT, we use 2 approaches of extracting embedding for each comment and for each code snippets

Approach 1 (_1): Get the [cls] embedding from the sequence of words in comment/ code following https://github.com/microsoft/CodeBERT/issues/112 Approach 2 (_2): Loading Models following nl-pl embedding tutorial of UnixCoder .

I got the following results:

1) 7584 pairs of comment-code in the clean version of tlcodesum:

Embedding Model Embedding Size (default)    MRR
unixcoder_2 768 45.91%
unixcoder_1 768 36.52%
graphcodebert_2 768 8.10%
graphcodebert_1 768 4.18%
codebert_2  768 0.27%
codebert_1  768 0.60%
roberta_1   1024    0.27%
bert_1  768 0.41%

2) 8714 pairs of comment-code in the raw version of tlcodesum:


Embedding Model Embedding Size (default)    MRR
unixcoder_2 768 38.51%
unixcoder_1 768 35.51%
graphcodebert_2 768 7.15%
graphcodebert_1 768 1.66%
codebert_2  768 0.32%
codebert_1  768 0.84%
roberta_1   1,024   0.21%
bert_1  768 0.43%

It seems that approach 2 got higher accuracy than approach 1 for UnixCoder and GraphCodeBERT. 
What I am surprised about here is the result of GraphCodeBERT is much lower than the UnixCoder. 
Do these results make sense?

You can find the algorithm of the code search, 
the code for achieving the embedding from query and code, 
and the details of code search result in my above forked project.

Thank you

microsoft / CodeBERT

Testing Performance of Code Search using NL-PL Embbedding Models on CAT benchmark #185