microsoft / CodeBERT

CodeBERT
MIT License
2.16k stars 446 forks source link

CodeBERT pre-training data #203

Closed aalkaswan closed 1 year ago

aalkaswan commented 1 year ago

Dear authors,

I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?

In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?

Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?

Thanks in advance,

guoday commented 1 year ago

You can use the script to extract pre-training data. In CodeSearchNet data, it has a filed called docstring for NL and "function_tokens" for code without comment. data.zip