How to pre-train CodeBert and GraphCodeBert in code search task on my own dataset?

skye95git commented 3 years ago

I have collected a lot of github source code, and extracted the functions from them as CodeSearchNet data format. There are 3TB data, including C, C ++ and other languages not available in CodeSearchNet.

1.How can I repre-train the CodeBert or GraphCodeBert model from scratch to do code search? Both the CodeBert repository and Siamese Demo only have fine-tuning commands. How do you do pre-training? Is it easy to share pre-train code?

2.When fine-tuning on code search, is the training corpus used the same as pre-training?

guoday commented 3 years ago

We don't plan to release the pre-training code in the near future.
Yes

guoday commented 3 years ago

I suggest that you can ask questions in the same issue so that I can reply in time.

microsoft / CodeBERT

How to pre-train CodeBert and GraphCodeBert in code search task on my own dataset? #69