Closed bbrowning closed 4 months ago
This is related to #95, although I felt it warranted its own issue here as this is mostly about taking the data mixing implementation done in another fork and getting it ready to merge back into this repo, while the other epic is mostly about tracking the actual implementation of data mixing and is likely already done, for some value of done.
determine if https://github.com/aakankshaduggal/sdg/pull/6 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation determine if https://github.com/aakankshaduggal/sdg/pull/9 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching
Yep these can come after mixing
Great, thanks @shivchander for that confirmation. I created separate issues to track batching/parallel (#167) and caching (#168), and updated the description above to link to those.
Added some additional items in the issue description where changes may be needed in instructlab/instructlab and/or instructlab/training to handle the new data-mixed filenames, or we may need to output filenames that are compatible with the existing prefix standard of train_*
. I'm not sure which way to proceed there yet, but will track that down.
After discussion with others offline, I took the approach of outputting additional files in the legacy train/test jsonl formats expected by the legacy Linux training code in ilab
. This gets the e2e CI job passing now. I've also tested manual generate/train workflows using the simple pipeline with legacy training, but have not yet verified full pipeline or new training work here.
Overview
The research team that developed InstructLab's processes has determined that we need a way to mix generated datasets before training. This is necessary to get the best results we can when adding knowledge to a model.
This issue tracks the work across the SDG and other repos required to implement this change.
instructlab/sdg repository
0.2.0
milestone: https://github.com/instructlab/sdg/milestone/4In-progress PR at #163
src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml
causing<path_to_dataset>
to actually get used as a path when attempting skill data generationbuild_raft_dataset
inparse_and_convert.py
to not infinitely loop when working with a small dataset, such as 1-2 pieces of generated data that we'll encounter in the "simple" CI pipeline or with users testing locally with very small numbers of instructions.generate_data.py
where if a knowledge taxonomy leaf gets generated first, it treats all subsequent taxonomy leaves as knowledge even though they may be skills, which blows uptrain_*.jsonl
andtest_*.jsonl
files.