Closed wasiahmad closed 4 years ago
Hi, thanks for your attention.
For "PBSMT", we use the default settings of Mosesdecoder for phrase-based SMT. The training data is tokenized by the Roberta tokenizer. "Roberta (code)" and "CodeBERT" are pre-training based methods. We use the pre-trained models to initialize the Transformer encoder and fine-tune the translation model with parallel training data. "Roberta (code)" is a Roberta model pre-trained only on code with MLM, while "CodeBERT" (https://arxiv.org/pdf/2002.08155.pdf) is pre-trained on code-text pairs with MLM and replaced token detection learning objectives.
As for the paper, we plan to make it publicly available by this month.
Thank you for your reply. Would you mind sharing a bit more information about GPT-2
, CodeGPT
, and CodeGPT-adapted
models that you use for the text-to-code generation task (on the Concode dataset)?
@wasiahmad Hi, GPT-2
, CodeGPT
and CodeGPT-adapted
are all GPT-style models. For GPT-2, We use OpenAI GPT-2 model and fine-tune on Concode dataset. When training, we concatenate NL description and source code as an example, training as a language model (calculating loss of source code only). CodeGPT and CodeGPT-adapted are pre-trained models on code. You could refer here for details. The fine-tuning process on Concode is same as GPT-2.
@celbree you said, "CodeGPT and CodeGPT-adapted are pre-trained models on code". Which source code is used to pre-train CodeGPT and CodeGPT-adapted?
We pre-train CodeGPT and CodeGPT-adapted on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods.
Can you provide some detail of the
PBSTM
andRoborta (code)
method for the code-to-code translation task? Also, when do you plan to make your paper publicly available?