jina-ai / jerboa

LLM finetuning
Apache License 2.0
41 stars 4 forks source link

Python QA instruction tuning dataset #77

Closed JohannesMessner closed 1 year ago

JohannesMessner commented 1 year ago

The idea is to collect a dataset similar to LIMA, but focused on the task of answering Python programming related questions.

Is just LIMA not enough? The honest answer is, we don't know (and nobody knows). But our thought is that answering programming related questions is different from answering most other questions because code snippets need to be included, commented, explained etc.

Outcomes:

Data sources: LIMA uses a variety of data sources, one of which is StackOverflow. However, LIMA only contains a total of 200 StackExchange instruction pairs, a fraction of which is SO, a fraction of which is Python. Therefore, mining SO for high quality Python questions still makes sense.

Overall, I think we can use the following sources:

Progress: