Closed oindrillac closed 2 months ago
The error isn't from e2e
exactly. It's the behavior of ilab model train --legacy
. That's the only training we support in CI (new training lib is still a work-in-progress). Either way, we have to keep all of the training methods functional.
The challenging thing here is depending on which training implementation is in use, it expects a different format, but we can't fix one and break the other.
But the full train will need a different format right? Is there a way to remove this check or add a check for both the formats based on the filename?
I think we've resolved this as of #94 and #115
We are generating these 3 files
See https://github.com/instructlab/sdg/pull/94
The e2e fails since it expects these keys
"system", "user", "assistant"
in the train data for the full train ie. dataset no. 2 whereas it should be looking for{"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}
The e2e needs to be updated to validate different formats for both 1 and 2
The SDG currently generates
Training dataset in a format needed for qlora train
generated_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.json
Generated dataset in the messages format needed for full train
train_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl
{"messages": [{"content": "", "role": "user"}, {"content": "", "role": "assistant"}], "metadata": ""}
Testing dataset used for testing old/new model behavior with the original questions
test_Mixtral-8x7B-Instruct-v0_2024-07-08T18_23_10.jsonl