Closed Anothernewcomer closed 2 years ago
Hi, the training set of the code summarization task with CodeSearchNet is a subset of the pre-training data, which simply follows what previous work (CodeBERT/GraphCodeBERT) did. And the bimodal dual generation aims to explore whether explicitly modeling the bidirectional conversion between NL-PL benefits different downstream tasks. We find it indeed improves the performance for code summarization and the effects are similar to multi-task learning.
I realize the CodeT5 have already saw the code-comment from CodeSerachNet as its input and output in the pretraining process as mentioned in the papaer "Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them." And the model used code-comment from CodeSerachNet again to finetune the code summarization task, won't it be a problem( as the model has already saw the data )? i'm a new hand to DL so please forgive me if this's a stupid question Orz