instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Add precomputed dataset to skills data generation #171

Open bbrowning opened 1 month ago

bbrowning commented 1 month ago

As a followup to #163, we need to figure out the right way to wire a precomputed dataset into the skills data generation. One example of such a dataset is https://github.com/instructlab/training/blob/9fdeb87820d5000f7be60a199c4e24aec725772e/sample-data/train_all_pruned_SDG.jsonl , however downstream uses of InstructLab, CI, or other reasons will warrant the ability to change this out in some way.

The initial implementation dropped out of scope from #163 had a placeholder in src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml like below:

datasets:
  - path: <path_to_dataset>
    sampling_size: 1.0

Discussing this with the community, @shivchander suggested we may want to pull this dataset from somewhere like HuggingFace. So, there's work to be done to figure out where the dataset should live, how the user gets it (explicitly pulls, implicitly pulls as needed, caching, etc), and how downstream uses, CI, or other scenarios will overwrite this precomputed dataset with their own.

bbrowning commented 1 month ago

The current iteration of #163 does not read recipe files at all, so if recipe files are how the precomputed datasets will be read then we'll want to plumb some version of reading from them back in once that lands.

Also, we'll likely need the ability to specify the system prompt that was used / to use with each precomputed dataset. Previously this was read from recipe files, so that may need to be added back as well if configurable system prompts are needed to properly use precomputed datasets.