instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
26 stars 37 forks source link

Pull taxonomy precomputed dataset from hugging face #201

Open aakankshaduggal opened 4 months ago

aakankshaduggal commented 4 months ago

In order to improve skills training, we require a precomputed dataset that is going to be mixed with the new synthetically generated dataset. This is mostly used during full training via instructlab/training, and not as important for the simpler legacy training in instructlab/instructlab.

The taxonomy precomputed dataset is hosted on hugging face -- https://huggingface.co/datasets/instructlab/InstructLabCommunity

We need a way to incorporate this dataset from HuggingFace and mix it with the synthetic generated data during an ilab data generate. One proposal is at https://github.com/aakankshaduggal/sdg/pull/20, but after some discussion at https://github.com/instructlab/sdg/pull/203#issuecomment-2250444499 we want to implement this in a bit different way so that we're not hitting HuggingFace silently/automatically but instead with an explicit step to download the precomputed dataset.

Users will have to download the dataset and place it in an appropriate cache directory. Potentially, there could be an ilab data download command to do this with a nicer user experience that asking them to manually download it to the appropriate directory.

Allow the name and/or path to the precomputed dataset to be supplied with e.g. ilab data generate --skills-dataset= and ilab would construct a simple skills recipe (in memory) and pass it to the library. We could limit this just to supplying a single precomputed skills dataset, or perhaps we want to allow the user to specify a list of precomputed skills and/or knowledge datasets on the command line?

We'll want to test this precomputed dataset in our e2e CI, which means ensuring we download and cache the dataset there so it's available at data generation time.

[Edited by @bbrowning to incorporate changes from https://github.com/instructlab/sdg/pull/203#issuecomment-2250444499 as well as in-person discussions with @aakankshaduggal ].

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

bbrowning commented 1 week ago

still relevant, as we aren't yet mixing in the community precomputed dataset