I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?
In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?
Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?
You can use the script to extract pre-training data. In CodeSearchNet data, it has a filed called docstring for NL and "function_tokens" for code without comment.
data.zip
Dear authors,
I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?
In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?
Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?
Thanks in advance,