I have collected a lot of github source code, and extracted the functions from them as CodeSearchNet data format. There are 3TB data, including C, C ++ and other languages not available in CodeSearchNet.
1.How can I repre-train the CodeBert or GraphCodeBert model from scratch to do code search? Both the CodeBert repository and Siamese Demo only have fine-tuning commands. How do you do pre-training? Is it easy to share pre-train code?
2.When fine-tuning on code search, is the training corpus used the same as pre-training?
I have collected a lot of github source code, and extracted the functions from them as CodeSearchNet data format. There are 3TB data, including C, C ++ and other languages not available in CodeSearchNet.
1.How can I repre-train the CodeBert or GraphCodeBert model from scratch to do code search? Both the CodeBert repository and Siamese Demo only have fine-tuning commands. How do you do pre-training? Is it easy to share pre-train code?
2.When fine-tuning on code search, is the training corpus used the same as pre-training?