instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Checkpoint files make iterating on a taxonomy awkward #245

Open bbrowning opened 1 month ago

bbrowning commented 1 month ago

When we create a checkpoint file, it has no notion of the version of the qna.yaml that it came from. We just assume a qna.yaml is unchanging and write a checkpoint file that gets used for that leaf node even if later on the qna.yaml is updated or changed entirely.

When writing a new knowledge or skill, a common workflow for me is to make a first attempt at the qna.yaml, run data generation, and see how the generated data looks. Then I may tweak my qna.yaml (change context, questions and answers, adjust the actual knowledge docs themselves) and do this multiple more times until I'm getting good data generation results. With the addition of checkpoint files, I now have to remember to manually remove all the checkpoints for this taxonomy leaf node every time before I re-run the data generation step. Otherwise, it just picks up the old checkpoint file even though I've since changed the qna.yaml to something that now makes that old checkpoint no longer valid.

It would be great if the checkpoints were somehow tied to the qna.yaml itself, so that if I change the qna.yaml in any way it knows to regenerate data there instead of reusing the checkpoint. Perhaps something as simple as calculating a hash of the qna.yaml file and embedding that in the name, directory, or content of the checkpoints would suffice?

derekhiggins commented 1 month ago

It would be great if the checkpoints were somehow tied to the qna.yaml itself, so that if I change the qna.yaml in any way it knows to regenerate data there instead of reusing the checkpoint. Perhaps something as simple as calculating a hash of the qna.yaml file and embedding that in the name, directory, or content of the checkpoints would suffice?

In addition to the qna.yaml file its probably also worth including the model being used and the contents of pipeline in the hash as the user may also be switching models or iterating on development of a new custom pipeline

bbrowning commented 1 month ago

@derekhiggins Nice foresight there - some testers just hit a case like you described, where they changed the pipeline used from one run to the next and it picked up old data from the previous pipeline, causing issues and requiring a blowing away of the generated datasets to get going again.