Difference between extracted information in embeddings fine-tuned on clone detection and code search

lazyhope commented 1 year ago

I have conducted an experiment with two UnixCoder models fine-tuned on clone detection and code search respectively (advtest for code search and another python clone dataset I found on the Internet for clone detection).

First, I encoded all functions from a repository to embeddings and aggregate them into a single mean embedding. Then I repeat this process for different repositories and compare them by their embeddings. The results showed that UniXCoder fine-tuned on a code search task produce better similarity scores than when fine-tuned on a clone detection task. This makes me wonder what is the difference between their embeddings in terms of information extracted from source code.

One hypothesis that I received from ChatGPT and found intriguing is that clone detection embeddings focus on capturing low-level syntactical, structural and semantic details of code fragments, whereas code search embeddings capture high-level concepts and intent.

Unfortunately, I couldn't find any sources to support these hypotheses. Therefore, I would appreciate hearing your opinions on this question. I understand that my experiment may not reflect the true case, but I'm curious if there is any existing research related to this topic that could provide more insights into the differences between the embeddings used for clone detection and code search tasks?

Thanks in advance!

guoday commented 1 year ago

because code search provide both text and code information, the embedding model fine-tuned on code search will be better.

lazyhope commented 1 year ago

Ok, thank you!

microsoft / CodeBERT

Difference between extracted information in embeddings fine-tuned on clone detection and code search #236