instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
11 stars 28 forks source link

[Epic] Support for mixing generated datasets before training #162

Closed bbrowning closed 1 month ago

bbrowning commented 1 month ago

Overview

The research team that developed InstructLab's processes has determined that we need a way to mix generated datasets before training. This is necessary to get the best results we can when adding knowledge to a model.

This issue tracks the work across the SDG and other repos required to implement this change.

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

In-progress PR at #163

bbrowning commented 1 month ago

This is related to #95, although I felt it warranted its own issue here as this is mostly about taking the data mixing implementation done in another fork and getting it ready to merge back into this repo, while the other epic is mostly about tracking the actual implementation of data mixing and is likely already done, for some value of done.

shivchander commented 1 month ago

determine if https://github.com/aakankshaduggal/sdg/pull/6 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation determine if https://github.com/aakankshaduggal/sdg/pull/9 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching

Yep these can come after mixing

bbrowning commented 1 month ago

Great, thanks @shivchander for that confirmation. I created separate issues to track batching/parallel (#167) and caching (#168), and updated the description above to link to those.

bbrowning commented 1 month ago

Added some additional items in the issue description where changes may be needed in instructlab/instructlab and/or instructlab/training to handle the new data-mixed filenames, or we may need to output filenames that are compatible with the existing prefix standard of train_*. I'm not sure which way to proceed there yet, but will track that down.

bbrowning commented 1 month ago

After discussion with others offline, I took the approach of outputting additional files in the legacy train/test jsonl formats expected by the legacy Linux training code in ilab. This gets the e2e CI job passing now. I've also tested manual generate/train workflows using the simple pipeline with legacy training, but have not yet verified full pipeline or new training work here.