Update E2E to expect different file formats for qlora train and full train

oindrillac commented 2 months ago

We are generating these 3 files

1. generated_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.json
2. train_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl
3. test_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl

See https://github.com/instructlab/sdg/pull/94

The e2e fails since it expects these keys "system", "user", "assistant" in the train data for the full train ie. dataset no. 2 whereas it should be looking for {"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}

The e2e needs to be updated to validate different formats for both 1 and 2

The SDG currently generates

Training dataset in a format needed for qlora train generated_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.json
```
"system":, "user":, "assistant": assistant
```
Generated dataset in the messages format needed for full train train_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl {"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}
Testing dataset used for testing old/new model behavior with the original questions test_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl
```
"system":, "user":, "assistant": ,
```

russellb commented 2 months ago

The error isn't from e2e exactly. It's the behavior of ilab model train --legacy. That's the only training we support in CI (new training lib is still a work-in-progress). Either way, we have to keep all of the training methods functional.

The challenging thing here is depending on which training implementation is in use, it expects a different format, but we can't fix one and break the other.

oindrillac commented 2 months ago

But the full train will need a different format right? Is there a way to remove this check or add a check for both the formats based on the filename?

russellb commented 2 months ago

I think we've resolved this as of #94 and #115

instructlab / sdg

Update E2E to expect different file formats for qlora train and full train #107