Include precomputed dataset and datamixing recipes

instructlab / sdg

Python library for Synthetic Data Generation

Apache License 2.0

12 stars 28 forks source link

Include precomputed dataset and datamixing recipes #234

Open aakankshaduggal opened 1 month ago

aakankshaduggal commented 1 month ago

We decided not to move forward with the hugging face approach, but to have better results with SDG + train, we need to have a precomputed dataset that the newly generated data will be mixed with.

Couple approaches that we could take --

Support ilab data download - this will pull the data from instructlab's hugging face.
Allow users to store their own precomputed dataset at a defined path.
If none of these are defined, either no mixing happens or have a default download from hugging face.

bbrowning commented 1 month ago

The only way to specify a default dataset today is to supply a default recipe yaml file for knowledge and/or skills. These would reside at a path like /usr/share/instructlab/sdg/default_data_recipes/skills.yaml, ~/.local/share/instructlab/sdg/default_data_recipes/skills.yaml, etc (where the exact path is system-dependent, from platformdir.PlatformDirs). So, a user could do this today by hand-writing a default recipe at the correct path. Or, something like ilab data download could download that dataset from HuggingFace, place it into an appropriate path, and then write out a default recipe that references it.

Once the default recipe file gets in the right place, the rest of the existing data generation code should automatically pick up and use that recipe for mixing.

bbrowning commented 1 month ago

Thinking more from a user's point-of-view, is downloading one or more precomputed datasets a different task from creating a recipe that uses those datasets? Would I want to ilab data download <some other HF dataset>, just like I can download different models? Where do those datasets get stored on disk when I do so? Once I've downloaded them, how do I generate a recipe to use them? How do I pass my custom dataset and/or recipe into ilab data generate?

And, all of this is only relevant for users with big hardware doing the full data generation pipeline and phased training, right? Does the precomputed dataset impact the output at all for any user doing legacy training, simple pipeline, or non-phased training?

markmc commented 1 month ago

xref #237