microsoft / CodeBERT

CodeBERT
MIT License
2.15k stars 442 forks source link

What are the training, evaluation and testing datasets of CodeBERT (the MLM pipeline particularely)? #271

Open Ahmedfir opened 1 year ago

Ahmedfir commented 1 year ago

Dear Madame or Sir,

Could you please provide us with a list of the projects that have been used for the training (including the evaluation) of CodeBERT? Particularly the CodeBERT-MLM task? From the paper, I understand that you have used the dataset provided by the CodeSearchNet challenge. But I could not find the information on which projects or what is used for training and what has been excluded. I see that for each described task/pipeline in the Readme.me, you have a specific folder for it with the corresponding training and evaluation datasets, except for the CodeBERT-MLM. Could you please help me in finding this information? Any help or guidance is welcome.

Thank you in advance and best regards!

guoday commented 1 year ago

Do you mean pre-training code for CodeBERT-MLM model?

Ahmedfir commented 1 year ago

Yes, the list of projects that have been used for pre-training the model CodeBERT-MLM. For instance, when I load with Roberta the model CodeBERT-MLM like so RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm"), on which projects this model has been pre-trained? This way, we can know which code we can consider as seen or as unseen during the pre-training of CodeBERT-MLM.

guoday commented 1 year ago

We don't release pre-training code. We only use train split to pre-train CodeBERT (https://huggingface.co/datasets/code_search_net).

Ahmedfir commented 1 year ago

Thank you very much.