salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Compilable code #74

Closed Debdeep1998 closed 1 year ago

Debdeep1998 commented 1 year ago

I've finetuned CodeT5 large on a small python dataset(~1700) data points. I see that the results are more or less correct but the code is not always compilable(due to inconsistent spacing and new line characters). Any idea on fixing this? And how CodeBLEU work if the code generated by the model isn't compilable? The model might generate non compilable code during initial phases of the training right?

yuewang-cuhk commented 1 year ago

Hi there, we cannot gaurantee the generated code is compilable for in a good format as we directly use the code files without normalization or refactoring for pretraining. You might consider to include another post-processing step to reformat the generated code from our models.

Debdeep1998 commented 1 year ago

Hi thanks, can you direct us to necessary post processing steps that we might need to adopt?