microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.56k stars 366 forks source link

Details of the baseline methods for code-to-code translation tasks #12

Closed wasiahmad closed 4 years ago

wasiahmad commented 4 years ago

Can you provide some detail of the PBSTM and Roborta (code) method for the code-to-code translation task? Also, when do you plan to make your paper publicly available?

Imagist-Shuo commented 4 years ago

Hi, thanks for your attention.

For "PBSMT", we use the default settings of Mosesdecoder for phrase-based SMT. The training data is tokenized by the Roberta tokenizer. "Roberta (code)" and "CodeBERT" are pre-training based methods. We use the pre-trained models to initialize the Transformer encoder and fine-tune the translation model with parallel training data. "Roberta (code)" is a Roberta model pre-trained only on code with MLM, while "CodeBERT" (https://arxiv.org/pdf/2002.08155.pdf) is pre-trained on code-text pairs with MLM and replaced token detection learning objectives.

As for the paper, we plan to make it publicly available by this month.

wasiahmad commented 4 years ago

Thank you for your reply. Would you mind sharing a bit more information about GPT-2, CodeGPT, and CodeGPT-adapted models that you use for the text-to-code generation task (on the Concode dataset)?

celbree commented 4 years ago

@wasiahmad Hi, GPT-2, CodeGPT and CodeGPT-adapted are all GPT-style models. For GPT-2, We use OpenAI GPT-2 model and fine-tune on Concode dataset. When training, we concatenate NL description and source code as an example, training as a language model (calculating loss of source code only). CodeGPT and CodeGPT-adapted are pre-trained models on code. You could refer here for details. The fine-tuning process on Concode is same as GPT-2.

wasiahmad commented 4 years ago

@celbree you said, "CodeGPT and CodeGPT-adapted are pre-trained models on code". Which source code is used to pre-train CodeGPT and CodeGPT-adapted?

celbree commented 4 years ago

We pre-train CodeGPT and CodeGPT-adapted on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods.