[Epic] Support for mixing generated datasets before training

bbrowning commented 4 months ago

Overview

The research team that developed InstructLab's processes has determined that we need a way to mix generated datasets before training. This is necessary to get the best results we can when adding knowledge to a model.

This issue tracks the work across the SDG and other repos required to implement this change.

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

In-progress PR at #163

[x] cherry-pick only the data-mixing commits from https://github.com/aakankshaduggal/sdg/pull/4/ on top of instructlab/sdg main (some other changes related to knowledge schema and other bits snuck into there)
[x] determine if batching and parallel generation is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation
- This can be done separately, and #135 is already tracking that work.
[x] determine if caching is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching
- This can be done separately, and #168 created to track merging that in.
[x] fix unresolved placeholder value in src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml causing <path_to_dataset> to actually get used as a path when attempting skill data generation
- This depends on a precomputed dataset being released somewhere (such as HuggingFace), and code modified to download and reference that dataset from its coordinates listed in default_recipe.yaml. For now I just removed the placeholder (so skills are running with no precomputed data added in), and opened #171 to track figuring this out.
[x] Update data mixing code to work with "simple" pipeline or, if not possible, discuss fallout if it only works in the "full" pipeline
- I was able to get this working with the simple pipeline by utilizing the _get_question and _get_response helpers.
[x] Fix build_raft_dataset in parse_and_convert.py to not infinitely loop when working with a small dataset, such as 1-2 pieces of generated data that we'll encounter in the "simple" CI pipeline or with users testing locally with very small numbers of instructions.
[x] Fix bug in generate_data.py where if a knowledge taxonomy leaf gets generated first, it treats all subsequent taxonomy leaves as knowledge even though they may be skills, which blows up
[x] Ensure legacy train continues to work by continuing to produce train_*.jsonl and test_*.jsonl files.
[x] ensure e2e tests pass with new data mixing code in CI
[x] confirm with instructlab/training that data mixing output format and content matches expectations for training's input
[x] remove trailing whitespace, unused imports, dead code, typos
[x] squash, reorder, reword, general clean up of existing commits plus new fixes
[x] ensure correct DCO and co-authorship on all commits, attributing original authors but signed off by me on any modified commits
[x] Create a follow-up PR to write out recipe yaml files - tracked at #185
- PR at #203
[ ] Create a follow-up PR to remove legacy train/messages jsonl formats, once instructlab/instructlab can work with only the new formats.
[x] Create a follow-up PR to add in precomputed datasets (partially done, see below)
- The code should all be in place to layer in precomputed datasets, but it's left mostly as an exercise for the user or downstream packager - details in https://github.com/instructlab/sdg/pull/203#issuecomment-2250444499 . #201 tracks making this easier for upstream users.
[x] Create a follow-up PR to add auxiliary datasets
- PR at #204
[x] Create a follow-up PR to add "duplicate context issue" -- https://github.com/instructlab/sdg/issues/200
- PR for the mixing side of this at #215
[ ] manually verify data generation is properly mixing data after all of the above

bbrowning commented 4 months ago

This is related to #95, although I felt it warranted its own issue here as this is mostly about taking the data mixing implementation done in another fork and getting it ready to merge back into this repo, while the other epic is mostly about tracking the actual implementation of data mixing and is likely already done, for some value of done.

shivchander commented 4 months ago

determine if https://github.com/aakankshaduggal/sdg/pull/6 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation determine if https://github.com/aakankshaduggal/sdg/pull/9 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching

Yep these can come after mixing

bbrowning commented 4 months ago

Great, thanks @shivchander for that confirmation. I created separate issues to track batching/parallel (#167) and caching (#168), and updated the description above to link to those.

bbrowning commented 4 months ago

Added some additional items in the issue description where changes may be needed in instructlab/instructlab and/or instructlab/training to handle the new data-mixed filenames, or we may need to output filenames that are compatible with the existing prefix standard of train_*. I'm not sure which way to proceed there yet, but will track that down.

bbrowning commented 4 months ago

After discussion with others offline, I took the approach of outputting additional files in the legacy train/test jsonl formats expected by the legacy Linux training code in ilab. This gets the e2e CI job passing now. I've also tested manual generate/train workflows using the simple pipeline with legacy training, but have not yet verified full pipeline or new training work here.

instructlab / sdg

[Epic] Support for mixing generated datasets before training #162

Overview

instructlab/sdg repository