salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.71k stars 396 forks source link

Hi i have a small question about the finetune dataset for code summarization #19

Closed Anothernewcomer closed 2 years ago

Anothernewcomer commented 2 years ago

I realize the CodeT5 have already saw the code-comment from CodeSerachNet as its input and output in the pretraining process as mentioned in the papaer "Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them." And the model used code-comment from CodeSerachNet again to finetune the code summarization task, won't it be a problem( as the model has already saw the data )? i'm a new hand to DL so please forgive me if this's a stupid question Orz

yuewang-cuhk commented 2 years ago

Hi, the training set of the code summarization task with CodeSearchNet is a subset of the pre-training data, which simply follows what previous work (CodeBERT/GraphCodeBERT) did. And the bimodal dual generation aims to explore whether explicitly modeling the bidirectional conversion between NL-PL benefits different downstream tasks. We find it indeed improves the performance for code summarization and the effects are similar to multi-task learning.