Python QA instruction tuning dataset

The idea is to collect a dataset similar to LIMA, but focused on the task of answering Python programming related questions.

Is just LIMA not enough? The honest answer is, we don't know (and nobody knows). But our thought is that answering programming related questions is different from answering most other questions because code snippets need to be included, commented, explained etc.

Outcomes:

Learning: Is such specific instruction tuning beneficial? If so, how much?
Dataset: We will have a finetuning dataset that we can share on HF if we want
Model: If the conclusion is positive, this is a necessary step towards the big goal of training a python model

Data sources: LIMA uses a variety of data sources, one of which is StackOverflow. However, LIMA only contains a total of 200 StackExchange instruction pairs, a fraction of which is SO, a fraction of which is Python. Therefore, mining SO for high quality Python questions still makes sense.

Overall, I think we can use the following sources:

StackOverflow
QA sections of popular projects (FastAPI, pydantic etc.)
GitHub discussions of popular projects
Forum posts, e.g. from the pytorch forum
Handcrafted QA pairs

Progress:

StackOverflow
- Original data is about 100GB of about 117M posts (questions+answers)
- using the LIMA criteria that was filtered down to about 860k QA pairs (~1.9GB)
- we can now more easily apply more filters on the smaller dataset
QA sections of popular projects (FastAPI, pydantic etc.)
- Scraped the QA sections of sklearn, seaborn, PyTorch, scrapy, lightgbm, pypy, and python (242 Pairs)
GitHub discussions of popular projects
Forum posts, e.g. from the pytorch forum
Handcrafted QA pairs

jina-ai / jerboa

Python QA instruction tuning dataset #77