Hi i have a small question about the finetune dataset for code summarization

salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

BSD 3-Clause "New" or "Revised" License

2.71k stars 396 forks source link

I realize the CodeT5 have already saw the code-comment from CodeSerachNet as its input and output in the pretraining process as mentioned in the papaer "Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them." And the model used code-comment from CodeSerachNet again to finetune the code summarization task, won't it be a problem( as the model has already saw the data )? i'm a new hand to DL so please forgive me if this's a stupid question Orz

salesforce / CodeT5

Hi i have a small question about the finetune dataset for code summarization #19