Closed Debdeep1998 closed 1 year ago
Hi, this might depend on your use case. We believe the common practice of code pretraining is not removing the comments from the training data. But if you want to specificially train on text-code bimodal tasks like code search, you might need to separate the code comments with the code snippets.
Do we need to remove comments from the code snippet that we provide in the training data? I finetuned CodeT5 large and it looks from the generated text that it is not able to distinguish between commented code and compilable code.