salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.71k stars 396 forks source link

Pre-training dataset #13

Closed mciniselli closed 2 years ago

mciniselli commented 2 years ago

Hi, thank you for this amazing model! I was wondering if you can share with us the 8.3M methods dataset used for the pretraining.

Thank you very much! Matteo

yuewang-cuhk commented 2 years ago

Hi, thanks for your interest! We are still in the process to resolve the potential risks of releasing the extra C/C# data collected from BigQuery. For the CodeSearchNet data, we employ all non-valid/test examples for pre-training CodeT5. You can access this data from its official repo.