microsoft / CodeBERT

CodeBERT
MIT License
2.16k stars 446 forks source link

Testing Performance of Code Search using NL-PL Embedding Models on CAT benchmark #186

Closed pdhung3012 closed 1 year ago

pdhung3012 commented 1 year ago

Hello, Sir/ Madam The CAT benchmark is a newly curated benchmark that might be useful for other developers https://arxiv.org/abs/2207.05579 I implemented the code for testing the code search function on 4 datasets provided in this benchmark, including CodeSearchNet, Funcom, PCSD and TLCodeSum.

The main idea of the code search algorithm is:1) Get query embedding; 2) Get candidates embedding 3) Comparing euclid distance between emb of query and emb of each candidate 4) Report the top-K accuracy of the search by the calculated distance.

I think it might be useful for other developers, so I made this pull request. If you want to discuss the results and the algorithm or your suggestion on improving the code you are welcome. If the pull request needs to be removed, I apologize for that.

Sincerely

pdhung3012 commented 1 year ago

Yes

guoday commented 1 year ago

Sorry. This repo only adds the codes from CodeBERT series paper.