instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
14 stars 30 forks source link

Update E2E to expect different file formats for qlora train and full train #107

Closed oindrillac closed 2 months ago

oindrillac commented 2 months ago

We are generating these 3 files

1. generated_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.json
2. train_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl
3. test_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl

See https://github.com/instructlab/sdg/pull/94

The e2e fails since it expects these keys "system", "user", "assistant" in the train data for the full train ie. dataset no. 2 whereas it should be looking for {"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}

The e2e needs to be updated to validate different formats for both 1 and 2

The SDG currently generates

  1. Training dataset in a format needed for qlora train generated_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.json

    "system":, "user":, "assistant": assistant
  2. Generated dataset in the messages format needed for full train train_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl {"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}

  3. Testing dataset used for testing old/new model behavior with the original questions test_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl

    "system":, "user":, "assistant": ,
russellb commented 2 months ago

The error isn't from e2e exactly. It's the behavior of ilab model train --legacy. That's the only training we support in CI (new training lib is still a work-in-progress). Either way, we have to keep all of the training methods functional.

The challenging thing here is depending on which training implementation is in use, it expects a different format, but we can't fix one and break the other.

oindrillac commented 2 months ago

But the full train will need a different format right? Is there a way to remove this check or add a check for both the formats based on the filename?

russellb commented 2 months ago

I think we've resolved this as of #94 and #115