Hello all. Following the suggestion of Daya Guo (developer of CodeBERT - this issue https://github.com/microsoft/CodeBERT/issues/174) on how to design the code search using the current NL-PL embedding module, I implement the code search function on the test part of the CAT dataset ((https://arxiv.org/abs/2207.05579). You can find the details of the implementation in this repository:
We use 5 models: UnixCoder, GraphCodeBERT, CodeBERT, Roberta, BERT. For UnixCoder, GraphCodeBERT and BERT, we use 2 approaches of extracting embedding for each comment and for each code snippets
2) 8714 pairs of comment-code in the raw version of tlcodesum:
Embedding Model Embedding Size (default) MRR
unixcoder_2 768 38.51%
unixcoder_1 768 35.51%
graphcodebert_2 768 7.15%
graphcodebert_1 768 1.66%
codebert_2 768 0.32%
codebert_1 768 0.84%
roberta_1 1,024 0.21%
bert_1 768 0.43%
It seems that approach 2 got higher accuracy than approach 1 for UnixCoder and GraphCodeBERT.
What I am surprised about here is the result of GraphCodeBERT is much lower than the UnixCoder.
Do these results make sense?
You can find the algorithm of the code search,
the code for achieving the embedding from query and code,
and the details of code search result in my above forked project.
Thank you
Hello all. Following the suggestion of Daya Guo (developer of CodeBERT - this issue https://github.com/microsoft/CodeBERT/issues/174) on how to design the code search using the current NL-PL embedding module, I implement the code search function on the test part of the CAT dataset ((https://arxiv.org/abs/2207.05579). You can find the details of the implementation in this repository:
We use 5 models: UnixCoder, GraphCodeBERT, CodeBERT, Roberta, BERT. For UnixCoder, GraphCodeBERT and BERT, we use 2 approaches of extracting embedding for each comment and for each code snippets
Approach 1 (_1): Get the [cls] embedding from the sequence of words in comment/ code following https://github.com/microsoft/CodeBERT/issues/112 Approach 2 (_2): Loading Models following nl-pl embedding tutorial of UnixCoder .
I got the following results:
1) 7584 pairs of comment-code in the clean version of tlcodesum:
2) 8714 pairs of comment-code in the raw version of tlcodesum: