[Epic] Ability to resume/continue an SDG cycle

ktam3 commented 1 month ago

Feature Overview (aka. Goal Summary) When running the SDG cycle ("ilab generate") the execution continues until all the provided documents are processed and the number of requested samples is generated.

This Feature card is to enhance InstructLab:

With the ability to track which document has been processed, is being currently processing and the ones that have not been processed
Provide the option of a "–continue" or "–resume" flag such that "ilab generate --resume" continues the SDG run in the document & page it was interrupted and continues generating and adding to the corpus.

Goals (aka. expected user outcomes) The observable functionality that the user now has due to receiving this feature. Include the anticipated primary user type/persona and which existing features will be expanded. Complete during New status.

An InstructLab user should be able to resume an SDG run after interruption.
This option should work for interruptions caused by:
- user-triggered interruptions (e.g. CTRL-C breaking the SDG cycle)
- SIGINT interruptions (e.g. when running on a spot instance, or Pod which receives SIGINT)
- system hardware failure (e.g. when a GPU failure cause the SDG run to crash

Requirements (aka. Acceptance Criteria): A list of specific needs or objectives the feature must deliver to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

User can resume an interrupted SDG cycle
User can use the SDG cycle on a Spot instance

bbrowning commented 1 month ago

A few high level items that need some thought and/or discussion:

How do we keep track of ilab data generate runs (sessions? not sure what to call these) that we might want to be able to resume? For example, if I run data generation on taxonomy A, taxonomy B, different branch/commit of taxonomy B, taxonomy C with the simple pipeline or taxonomy C with the full pipeline - each of these would have different saves that we'd need to keep track of. Or, is the assumption that the user can only ever resume the most recently run ilab data generate and must use the exact same set of args, taxonomy, taxonomy branch/commit, etc as when they started it? Is this even a concern of SDG, or do we task the CLI with this type of bookkeeping?
How do we handle documents (qna.yaml or the knowledge docs) that have changed in some way since we started processing the first time? Do we guard against this, with some kind of per-document (or more likely per-chunk) hashes we save to ensure that we're not resuming against a document that has been modified?
What types of user action cause us to invalidate our saved checkpoints and refuse to resume? This is kind of the inverse of the first bullet - what are the things that we know would break resuming?
What type of durable storage should we assume is available to us for any state tracking required here? Our limited checkpoint state today is exclusively filesystem-based - is that what we should continue to use, encoding any state we need into files in our datasets directory?
When (if ever) do we clean up these intermediate states? Can we assume that once a specific SDG session runs to completion that any of the interim bookkeeping data we've written out can be deleted?

Some of the above is complex. If that's too complex to get started, what's our happy path case that we want to ensure works? Is it good enough to assume a user runs a single ilab data generate ..., it gets interrupted, and then they run it again and we continue from where they left off automatically? And if they change any params, docs, config, etc they have to manually blow away the entire state we've saved to start from a clean slate? We've already seen requiring users to manually clean SDG checkpoints between runs can be confusing for them and lead to errors if they forget - are we ok continuing down that path, or is this the right time to make this more friendly?

ktam3 commented 1 month ago

Per discussion with @aakankshaduggal - a higher priority feature has been introduced #271

Though this is still in scope for 1.2, Product Management is aware that team focus is to complete #271 which may cause this feature to slip. @bbrowning may take point in this instead, with Aakanksha helping if she has additional bandwidth

instructlab / sdg

[Epic] Ability to resume/continue an SDG cycle #267