instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
5 stars 13 forks source link

Initial CLI integration with new SDG interfaces #46

Closed russellb closed 2 days ago

russellb commented 5 days ago

General Approach

The approach taken here is to try to make the new SDG API the only interface used as opposed to keeping the old implementation off to the side. Reasons for this:

To prepare for this, we pulled all SDG code out of the CLI to start this library. The interface point between the CLI and instructlab.sdg is the generate_data() method in instructlab.sdg.generate_data. That is where you'll see most of the changes centered in this PR.

The PR changes the implementation underneath the existing generate_data() so we can make things work without further CLI changes.

Status Summary

This has been tested with Merlinite both locally and via the e2e CI job. The intent here is to replace the existing functionality. Future changes will work on adding more extensive pipeline support for larger systems.

Changes

1e928ec Split up the utils module 39c943f Import utils.read_taxonomy() from instructlab d886b12 Add get_model_family() from instructlab.utils 67dd057 Add model_prompt for merlinite/granite d53a57a Add API to enable disabling batching 14d646f llmblock: fix batched=False 0207156 Add simple knowledge pipeline for use with default merlinite 9ee7f70 Use new API in generate_data 54b065a added number of iterations to generate 31ecfda Add skills variants for the full pipeline

commit 1e928ec0eb2ac8d158315d7b7679a5f1f22cf429 Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 07:27:11 2024 -0400

Split up the utils module

To stop utils.py becoming a dumping ground, split the current code
into json, chunking, and openai sub-modules.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

commit 39c943f4c058b503188ec4541d0347696c00b8da Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 06:27:21 2024 -0400

Import utils.read_taxonomy() from instructlab

Part of #11

sdg appears to be the main user of this, along with `ilab taxonomy diff`.

We want to adapt the output of read_taxonomy() to be better suited to
what sdg needs.

This is the majority of src/instructlab/utils.py from commit commit 4737feb
with read_taxonomy() and TaxonomyReadingException as the public API.

Temporarily disable logging-fstring-interpolation to get lint passing.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

commit d886b12cbbc3cf3168ea2c125ad239b4030af76e Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 17:57:24 2024 -0400

Add get_model_family() from instructlab.utils

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 67dd0575ec6cd38ae60f7794c5be6e9fb39c37f2 Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 20:58:29 2024 -0400

Add model_prompt for merlinite/granite

The model_family param is used to "force" a particular family,
overriding what might be guessed from the model filename.

Since the utils.models is copied and pasted from instructlab,
let's isolate the use of utils.models to the generate_data()
function so if we move the generate_data() code to instructlab
we can get rid of the copy here.

In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to
the API.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>

commit d53a57ad0d28d36f02103b2efd37fe3523abbb5a Author: Mark McLoughlin markmc@redhat.com Date: Thu Jun 27 05:55:32 2024 -0400

Add API to enable disabling batching

Add a `batched` constructor param.

llama-cpp is still the CLI default and doesn't support batching. We
don't know in this code what backend is being used, so for now just
turn off batching. We need to come back around to disabling it only
when we know the default llama-cpp backend is in use.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>

commit 14d646f7eff4b5eca9545a0433dc5ea11c8a6a3f Author: Russell Bryant rbryant@redhat.com Date: Wed Jun 26 09:23:40 2024 -0400

llmblock: fix batched=False

_generate() returns a list of samples. The previouis code would create
a list of lists when batching was turned off. This restores that case
back to a list of samples.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 0207156c87a7893c8466cd0b1832cd76b1d033a8 Author: Russell Bryant rbryant@redhat.com Date: Wed Jun 26 11:17:09 2024 -0400

Add simple knowledge pipeline for use with default merlinite

The CLI's default model is quantized merlinite, and it does not seem
good enough to follow the instructions in the full pipeline included
in the new library.

It's not doing any validation on the output, so the output is not
going to be great.  Then again, the output has never been great doing
SDG with merlinite and the old sdg implementation. This at least keeps
the ability to a basic workflow test and demo on a smaller system.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 9ee7f7073c2c70b83a32bd0eb575b2053c23da79 Author: Russell Bryant rbryant@redhat.com Date: Tue Jun 25 16:31:19 2024 -0400

Use new API in generate_data

This makes use of the new SDG API under the generate_data()
method used by the CLI. It uses new simple workflows for knowlege and
skills that inteded for basic usable with a small model for testing
and demo purposes. The full pipelines provided in the library will
only work in larger environments capable of running Mixtral-8x7b.

There are still various TODOs in the code, but this is enough to
start with. I'm sure we will make enhancements to these basic
workflows that still work for the small environments.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 54b065a4cfd8e63d863a059b5f3bb2f9dd726843 Author: Oindrilla Chatterjee oc@bu.edu Date: Fri Jun 28 11:52:12 2024 -0400

added number of iterations to generate

Signed-off-by: Oindrilla Chatterjee <oc@bu.edu>

commit 31ecfda5dc6cbe95e8248df1c645727ae98c0cee Author: Russell Bryant rbryant@redhat.com Date: Fri Jun 28 16:30:24 2024 -0400

Add skills variants for the full pipeline

Add the last little bit needed to choose the right "full" pipeline for
skills.

I also renamed "profile" to "pipeline" to better reflect what is being
selected here. The term "profile" is a bit overloaded from lots of
past CLI UX discussion, so it's better not to use that here.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
markmc commented 4 days ago

I'm going to try to help this along

Goal: to have ilab data generate use the new SDG API (Flows, Pipelines, Blocks, etc.) and support "full SDG"

(Re - "full SDG" - I'm a bit unclear how we define that precisely)

Constraint: avoid causing any regression in "simple SDG" (aka "small model support")

Step #1 - re-implement instructlab.sdg.generate_data.generate_data() using the new API - i.e. model the existing "simple SDG" using the new API, and then re-write generate_data() to use it

Step #2 - replace the call to instructlab.sdg.generate_data.generate_data() in instructlab/data/generate.py with calls to the new API - i.e. move the relatively simple code in generate_data() back into instructlab

Step #3 - add "full SDG" support in ilab data generate - e.g. by using the more advanced flows available via the new API


This PR is step #1

We can merge this PR once we're confident it is in good enough shape that it doesn't regress the existing ilab data generate (unless there's a particular regression we decide we're ok with)


I've been doing some hacking to get up to speed: https://github.com/russellb/sdg/compare/new-cli-integration...markmc:sdg:new-cli-integration?expand=1

Not being deeply familiar with the existing "simple SDG" flow though, I'm going to be slow in getting the new implementation up to parity. Happy to have help!

markmc commented 4 days ago

Next steps AIUI:

  1. verify generate_data() is producing output in the same format as before
  2. create and incorporate workflows for freeform and grounded skills
  3. audit the CLI options to help identify any functional regressions
  4. get e2e passing
markmc commented 4 days ago

e2e test failure is:

Generate synthetic data - Wed Jun 26 18:32:00 UTC 2024
...
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:8000/v1 server
INFO 2024-06-26 18:32:13,652 generate_data.py:222: generate_data Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-06-26 18:32:13,670 pipeline.py:48: generate Running block: gen_knowledge
INFO 2024-06-26 18:32:13,670 pipeline.py:49: generate Dataset({
    features: ['task_description', 'document', 'domain', 'question_1', 'response_1', 'question_2', 'response_2', 'question_3', 'response_3'],
    num_rows: 6
})
INFO 2024-06-26 18:32:56,378 generate_data.py:276: generate_data Generated 5 samples
INFO 2024-06-26 18:32:56,379 generate_data.py:282: generate_data Generation took 49.68s
...
Train the model - Wed Jun 26 18:32:57 UTC 2024
...
generated does not contain training or test files, did you run `ilab generate`?
Error: Process completed with exit code 1.
mergify[bot] commented 3 days ago

This pull request has merge conflicts that must be resolved before it can be merged. @russellb please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb commented 3 days ago

Status notes before I log off for tonight.

The Good News

Status

The e2e test job runs 3 rounds of data generation: grounded skill (a skill with context), freeform skill (a skill with no context), and knowledge. The two skills steps are exiting cleanly having generated zero data. The training step that follows is consuming the data we've generated for the knowledge addition and passing.

This is a great milestone, but we have some more to go.

Next Top Priority

Other TODO items

There are still other todo items scattered around generate_data.py

Post-merge TODOs for the simple workflow

These are things I think can come after we merge this PR (IMO, open for discussion)

Full workflow via CLI

Most of the code we've done here applies to the full workflow, we just need to instantiate a different pipeline.

russellb commented 2 days ago

saving off the "how to test" instructions here before I redo the PR description now that it's almost ready for merge

How to Test

(assumes you have the gh CLI installed and configured) -- https://cli.github.com/


# Setup a venv to work in
mkdir workspace
cd workspace
python -m venv venv
. venv/bin/activate

# Install the `ilab` CLI from main
git clone https://github.com/instructlab/instructlab
cd instructlab
pip install -e .
cd ..

# Install this PR version of instructlab-sdg
git clone https://github.com/instructlab/sdg
cd sdg
gh co 46
pip install -e .
cd ..

# Configure ilab and put a test taxonomy addition in place
# allow it to check out taxonomy for you
ilab config init
cd taxonomy
git remote add russellb https://github.com/russellb/taxonomy.git
git fetch russellb
git checkout russellb/softball
cd ..

# Download merlinite
ilab download

# Optional - serve the model in another terminal.  helpful if you get server side errors and want all the logs
cd workspace
. venv/bin/activate
ilab model serve

# and now you can test sdg
ilab data generate

# Find the results in the generated/ directory
ls generated/
cat generated/generated_*
russellb commented 2 days ago

Most of the CI jobs aren't running right now. GitHub is having problems: https://www.githubstatus.com/

markmc commented 2 days ago

FWIW, +1 from me if we're not aware of it causing significant regressions

russellb commented 2 days ago

FWIW, +1 from me if we're not aware of it causing significant regressions

Thanks! As long as e2e still passes, I'm good with this going in and then we continue to iterate from here. It won't actually get picked up by anyone else until we tag a library release, anyway.

russellb commented 2 days ago

I'm going to merge this, but won't cut a release just yet. I want to look at knowledge document chunking in particular. I don't think I fully restored the previous behavior on that.