Dataset for fine-tuning on python for Code generation task

salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

BSD 3-Clause "New" or "Revised" License

2.68k stars 394 forks source link

Dear Sir,

For the text-to-code generation task, the model is fine-tuned on the Concode Java dataset. But, I want to fine-tune the model on Python dataset. While I was figuring out how to do this, I went across the following issue : https://github.com/salesforce/CodeT5/issues/36 where it is mentioned that we can fine-tune on the python subset of CodeSearchNet.

But, the python subset of CodeSearchNet contains various fields such as repo, path, url, original string, etc. whereas the Concode dataset contains only two fields for each function : code and nl. So, can you please guide me how can I create a similar dataset for python also so that I can fine-tune the text-to-code generation task on Python?

salesforce / CodeT5

Dataset for fine-tuning on python for Code generation task #42