Closed shivchander closed 4 months ago
Eval is expecting this as well for MMLU! Tracker issue on our side: https://github.com/instructlab/eval/issues/35
the current output is consistent with what the CLI has always done.
we can change it, but we have to coordinate it across all components, including code still in the CLI, that depends on it
yes I see that it converts to messages - but we still need to handle all other columns so that it doesn't break when we concatenate with other datasets during training.
Added some more details about the schema we need in my original comment
The training library currently expects messages, but I believe legacy Linux training and MacOS training doesn't.
Where do we see this conversion happening? Would this be in _gen_train_data in generate_data.py? Will we be maintaining 2 formats, the existing one that is for qlora and the additional new format as proposed above for the full train?
Yes you are spot on @oindrillac https://github.com/instructlab/sdg/blob/45ecc73ada3d8a06b246f21ebe87b5a07b206654/src/instructlab/sdg/generate_data.py#L80-L98
We would want to have two separate outputs, one in the format the CLI expects for legacy training, and one for the new version that would expect the messages format.
cool, and based on whether pipeline == "full"
or pipeline == "simple"
we can enforce the final output to be in a certain required format?
Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.
cool, and based on whether
pipeline == "full"
orpipeline == "simple"
we can enforce the final output to be in a certain required format?
or we could just always produce both?
Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.
For what's it worth, on the eval
side we just need a path to a directory where the messages data is going to be dumped to - similar to the generate
directory the CLI dumps data to currently.
It would be in the same directory and we can produce both formats as per @russellb suggestion
Currently the generated synthetic data has the q/a/context in its own columns, the new training api assumes the datasets are formatted in messages format
Will need a simple util function to run post generation to handle the conversion
Columns to have: