Open aakankshaduggal opened 4 months ago
This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.
still relevant, as we aren't yet mixing in the community precomputed dataset
In order to improve skills training, we require a precomputed dataset that is going to be mixed with the new synthetically generated dataset. This is mostly used during full training via instructlab/training, and not as important for the simpler legacy training in instructlab/instructlab.
The taxonomy precomputed dataset is hosted on hugging face -- https://huggingface.co/datasets/instructlab/InstructLabCommunity
We need a way to incorporate this dataset from HuggingFace and mix it with the synthetic generated data during an
ilab data generate
. One proposal is at https://github.com/aakankshaduggal/sdg/pull/20, but after some discussion at https://github.com/instructlab/sdg/pull/203#issuecomment-2250444499 we want to implement this in a bit different way so that we're not hitting HuggingFace silently/automatically but instead with an explicit step to download the precomputed dataset.Users will have to download the dataset and place it in an appropriate cache directory. Potentially, there could be an
ilab data download
command to do this with a nicer user experience that asking them to manually download it to the appropriate directory.Allow the name and/or path to the precomputed dataset to be supplied with e.g.
ilab data generate --skills-dataset=
and ilab would construct a simple skills recipe (in memory) and pass it to the library. We could limit this just to supplying a single precomputed skills dataset, or perhaps we want to allow the user to specify a list of precomputed skills and/or knowledge datasets on the command line?We'll want to test this precomputed dataset in our e2e CI, which means ensuring we download and cache the dataset there so it's available at data generation time.
[Edited by @bbrowning to incorporate changes from https://github.com/instructlab/sdg/pull/203#issuecomment-2250444499 as well as in-person discussions with @aakankshaduggal ].