salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.66k stars 391 forks source link

Questions about data preprocessing #80

Open qkim2525 opened 1 year ago

qkim2525 commented 1 year ago

Hi.

Me and my two colleagues are interested in replicating the results of CodeT5-base on code generation task with our own dataset. However we're having a few hiccups on preprocessing data, and we hope you don't mind a few questions.

Mainly, we're wondering how you dealt with the Google BigQuery data alongside CodeSearchNet. From our knowledge, CodeSearchNet data is consisted of codes that are nicely isolated blocks of function codes, while the BigQuery data and other extra data of C/C# from open-source Github repositories is, as far as we guess, isn't presented in such a convenient matter.

Our own data is also in a similar position where none of the codes are either isolated to blocks of function codes, but rather mostly a complete file of itself. We were wondering if you did any preprocessing of your own so that the extra data of C/C# would match that of CodeSearchNet, or if you just used it raw.

And if you did use it raw, has it affected the performance compared to when the model was trained only with CodeSearchNet data? Thank you in advance.

P.S. My colleagues are also wondering how you dealt with the whitespace, arguing that the paper wasn't so clear with that. One argues that you discarded whitespace all-together, while the other argues that you only removed duplicates of whitespce into one instance. ex) A. '\s\s\s' --> '' B. '\s\s\s' --> '\s'