Closed russellb closed 2 days ago
I'm going to try to help this along
Goal: to have ilab data generate use the new SDG API (Flows, Pipelines, Blocks, etc.) and support "full SDG"
(Re - "full SDG" - I'm a bit unclear how we define that precisely)
Constraint: avoid causing any regression in "simple SDG" (aka "small model support")
Step #1 - re-implement instructlab.sdg.generate_data.generate_data()
using the new API - i.e. model the existing "simple SDG" using the new API, and then re-write generate_data() to use it
Step #2 - replace the call to instructlab.sdg.generate_data.generate_data()
in instructlab/data/generate.py
with calls to the new API - i.e. move the relatively simple code in generate_data()
back into instructlab
Step #3 - add "full SDG" support in ilab data generate - e.g. by using the more advanced flows available via the new API
This PR is step #1
We can merge this PR once we're confident it is in good enough shape that it doesn't regress the existing ilab data generate (unless there's a particular regression we decide we're ok with)
I've been doing some hacking to get up to speed: https://github.com/russellb/sdg/compare/new-cli-integration...markmc:sdg:new-cli-integration?expand=1
Not being deeply familiar with the existing "simple SDG" flow though, I'm going to be slow in getting the new implementation up to parity. Happy to have help!
Next steps AIUI:
e2e test failure is:
Generate synthetic data - Wed Jun 26 18:32:00 UTC 2024
...
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:8000/v1 server
INFO 2024-06-26 18:32:13,652 generate_data.py:222: generate_data Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-06-26 18:32:13,670 pipeline.py:48: generate Running block: gen_knowledge
INFO 2024-06-26 18:32:13,670 pipeline.py:49: generate Dataset({
features: ['task_description', 'document', 'domain', 'question_1', 'response_1', 'question_2', 'response_2', 'question_3', 'response_3'],
num_rows: 6
})
INFO 2024-06-26 18:32:56,378 generate_data.py:276: generate_data Generated 5 samples
INFO 2024-06-26 18:32:56,379 generate_data.py:282: generate_data Generation took 49.68s
...
Train the model - Wed Jun 26 18:32:57 UTC 2024
...
generated does not contain training or test files, did you run `ilab generate`?
Error: Process completed with exit code 1.
This pull request has merge conflicts that must be resolved before it can be merged. @russellb please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Status notes before I log off for tonight.
The Good News
Status
The e2e test job runs 3 rounds of data generation: grounded skill (a skill with context), freeform skill (a skill with no context), and knowledge. The two skills steps are exiting cleanly having generated zero data. The training step that follows is consuming the data we've generated for the knowledge addition and passing.
This is a great milestone, but we have some more to go.
Next Top Priority
src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml
but tweaked a bit. there's no document. The grounded skill one will be similar in that we'll be giving a context blob similar to how it gives a document for knowledge. The freeform skill one will not have context or document.default_flows.py
, in addition to the SimpleKnowledgeFlow
that's there, we need one or two skills flows that are the simple one-stage thingsgenerate_data.py
needs updates to make use of these new flowsOther TODO items
There are still other todo items scattered around generate_data.py
num-instructions
parameter. @oindrillac was looking at that today.Post-merge TODOs for the simple workflow
These are things I think can come after we merge this PR (IMO, open for discussion)
Full workflow via CLI
Most of the code we've done here applies to the full workflow, we just need to instantiate a different pipeline.
instructlab/dev-docs
just to clarify the UX. --pipeline simple
and --pipeline full
might be enough, but there was a LOT of debate over a very similar set of options proposed for training that never got full support. That may be different though because training has so many variables.saving off the "how to test" instructions here before I redo the PR description now that it's almost ready for merge
How to Test
(assumes you have the gh
CLI installed and configured) -- https://cli.github.com/
# Setup a venv to work in
mkdir workspace
cd workspace
python -m venv venv
. venv/bin/activate
# Install the `ilab` CLI from main
git clone https://github.com/instructlab/instructlab
cd instructlab
pip install -e .
cd ..
# Install this PR version of instructlab-sdg
git clone https://github.com/instructlab/sdg
cd sdg
gh co 46
pip install -e .
cd ..
# Configure ilab and put a test taxonomy addition in place
# allow it to check out taxonomy for you
ilab config init
cd taxonomy
git remote add russellb https://github.com/russellb/taxonomy.git
git fetch russellb
git checkout russellb/softball
cd ..
# Download merlinite
ilab download
# Optional - serve the model in another terminal. helpful if you get server side errors and want all the logs
cd workspace
. venv/bin/activate
ilab model serve
# and now you can test sdg
ilab data generate
# Find the results in the generated/ directory
ls generated/
cat generated/generated_*
Most of the CI jobs aren't running right now. GitHub is having problems: https://www.githubstatus.com/
FWIW, +1 from me if we're not aware of it causing significant regressions
FWIW, +1 from me if we're not aware of it causing significant regressions
Thanks! As long as e2e
still passes, I'm good with this going in and then we continue to iterate from here. It won't actually get picked up by anyone else until we tag a library release, anyway.
I'm going to merge this, but won't cut a release just yet. I want to look at knowledge document chunking in particular. I don't think I fully restored the previous behavior on that.
General Approach
The approach taken here is to try to make the new SDG API the only interface used as opposed to keeping the old implementation off to the side. Reasons for this:
To prepare for this, we pulled all SDG code out of the CLI to start this library. The interface point between the CLI and
instructlab.sdg
is thegenerate_data()
method ininstructlab.sdg.generate_data
. That is where you'll see most of the changes centered in this PR.The PR changes the implementation underneath the existing
generate_data()
so we can make things work without further CLI changes.Status Summary
This has been tested with Merlinite both locally and via the
e2e
CI job. The intent here is to replace the existing functionality. Future changes will work on adding more extensive pipeline support for larger systems.Changes
1e928ec Split up the utils module 39c943f Import utils.read_taxonomy() from instructlab d886b12 Add get_model_family() from instructlab.utils 67dd057 Add model_prompt for merlinite/granite d53a57a Add API to enable disabling batching 14d646f llmblock: fix batched=False 0207156 Add simple knowledge pipeline for use with default merlinite 9ee7f70 Use new API in generate_data 54b065a added number of iterations to generate 31ecfda Add skills variants for the full pipeline
commit 1e928ec0eb2ac8d158315d7b7679a5f1f22cf429 Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 07:27:11 2024 -0400
commit 39c943f4c058b503188ec4541d0347696c00b8da Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 06:27:21 2024 -0400
commit d886b12cbbc3cf3168ea2c125ad239b4030af76e Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 17:57:24 2024 -0400
commit 67dd0575ec6cd38ae60f7794c5be6e9fb39c37f2 Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 20:58:29 2024 -0400
commit d53a57ad0d28d36f02103b2efd37fe3523abbb5a Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 05:55:32 2024 -0400
commit 14d646f7eff4b5eca9545a0433dc5ea11c8a6a3f Author: Russell Bryant rbryant@redhat.com Date: Wed Jun 26 09:23:40 2024 -0400
commit 0207156c87a7893c8466cd0b1832cd76b1d033a8 Author: Russell Bryant rbryant@redhat.com Date: Wed Jun 26 11:17:09 2024 -0400
commit 9ee7f7073c2c70b83a32bd0eb575b2053c23da79 Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 16:31:19 2024 -0400
commit 54b065a4cfd8e63d863a059b5f3bb2f9dd726843 Author: Oindrilla Chatterjee oc@bu.edu Date: Fri Jun 28 11:52:12 2024 -0400
commit 31ecfda5dc6cbe95e8248df1c645727ae98c0cee Author: Russell Bryant rbryant@redhat.com Date: Fri Jun 28 16:30:24 2024 -0400