CodeBERT pre-training data

Dear authors,

I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?

In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?

Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?

Thanks in advance,

microsoft / CodeBERT

CodeBERT pre-training data #203