salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Removing comments from training data #75

Closed Debdeep1998 closed 1 year ago

Debdeep1998 commented 1 year ago

Do we need to remove comments from the code snippet that we provide in the training data? I finetuned CodeT5 large and it looks from the generated text that it is not able to distinguish between commented code and compilable code.

yuewang-cuhk commented 1 year ago

Hi, this might depend on your use case. We believe the common practice of code pretraining is not removing the comments from the training data. But if you want to specificially train on text-code bimodal tasks like code search, you might need to separate the code comments with the code snippets.