instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
7 stars 18 forks source link

Convert generated dataset to messages format #60

Open shivchander opened 4 days ago

shivchander commented 4 days ago

Currently the generated synthetic data has the q/a/context in its own columns, the new training api assumes the datasets are formatted in messages format

Will need a simple util function to run post generation to handle the conversion

Columns to have:


messages: list[dict] - with roles for system, user, assistant
metadata: json str - json string version of a dictionary which acts like a catch-all for all other columns in the dataset
nathan-weinberg commented 4 days ago

Eval is expecting this as well for MMLU! Tracker issue on our side: https://github.com/instructlab/eval/issues/35

russellb commented 4 days ago

the current output is consistent with what the CLI has always done.

we can change it, but we have to coordinate it across all components, including code still in the CLI, that depends on it

shivchander commented 4 days ago

yes I see that it converts to messages - but we still need to handle all other columns so that it doesn't break when we concatenate with other datasets during training.

Added some more details about the schema we need in my original comment

RobotSail commented 4 days ago

The training library currently expects messages, but I believe legacy Linux training and MacOS training doesn't.

oindrillac commented 2 days ago

Where do we see this conversion happening? Would this be in _gen_train_data in generate_data.py? Will we be maintaining 2 formats, the existing one that is for qlora and the additional new format as proposed above for the full train?

aakankshaduggal commented 2 days ago

Yes you are spot on @oindrillac https://github.com/instructlab/sdg/blob/45ecc73ada3d8a06b246f21ebe87b5a07b206654/src/instructlab/sdg/generate_data.py#L80-L98

We would want to have two separate outputs, one in the format the CLI expects for legacy training, and one for the new version that would expect the messages format.

oindrillac commented 2 days ago

cool, and based on whether pipeline == "full" or pipeline == "simple" we can enforce the final output to be in a certain required format?

aakankshaduggal commented 2 days ago

Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.

russellb commented 2 days ago

cool, and based on whether pipeline == "full" or pipeline == "simple" we can enforce the final output to be in a certain required format?

or we could just always produce both?

nathan-weinberg commented 15 hours ago

Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.

For what's it worth, on the eval side we just need a path to a directory where the messages data is going to be dumped to - similar to the generate directory the CLI dumps data to currently.

oindrillac commented 14 hours ago

It would be in the same directory and we can produce both formats as per @russellb suggestion