salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Dataset for fine-tuning on python for Code generation task #42

Closed hitesh-anand closed 2 years ago

hitesh-anand commented 2 years ago

Dear Sir,

For the text-to-code generation task, the model is fine-tuned on the Concode Java dataset. But, I want to fine-tune the model on Python dataset. While I was figuring out how to do this, I went across the following issue : https://github.com/salesforce/CodeT5/issues/36 where it is mentioned that we can fine-tune on the python subset of CodeSearchNet.

But, the python subset of CodeSearchNet contains various fields such as repo, path, url, original string, etc. whereas the Concode dataset contains only two fields for each function : code and nl. So, can you please guide me how can I create a similar dataset for python also so that I can fine-tune the text-to-code generation task on Python?

yuewang-cuhk commented 2 years ago

Hi, if you want to employ the Python subset in CodeSearchNet to train a text-to-code generation model, you can also get the nl and code information from it. The CodeSearchNet dataset contains other fields such as docstrings (nl) and code_tokens (code). You just need to filer those with empty docstrings.