instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
21 stars 34 forks source link

Convert generated dataset to messages format #60

Closed shivchander closed 4 months ago

shivchander commented 4 months ago

Currently the generated synthetic data has the q/a/context in its own columns, the new training api assumes the datasets are formatted in messages format

Will need a simple util function to run post generation to handle the conversion

Columns to have:


messages: list[dict] - with roles for system, user, assistant
metadata: json str - json string version of a dictionary which acts like a catch-all for all other columns in the dataset
nathan-weinberg commented 4 months ago

Eval is expecting this as well for MMLU! Tracker issue on our side: https://github.com/instructlab/eval/issues/35

russellb commented 4 months ago

the current output is consistent with what the CLI has always done.

we can change it, but we have to coordinate it across all components, including code still in the CLI, that depends on it

shivchander commented 4 months ago

yes I see that it converts to messages - but we still need to handle all other columns so that it doesn't break when we concatenate with other datasets during training.

Added some more details about the schema we need in my original comment

RobotSail commented 4 months ago

The training library currently expects messages, but I believe legacy Linux training and MacOS training doesn't.

oindrillac commented 4 months ago

Where do we see this conversion happening? Would this be in _gen_train_data in generate_data.py? Will we be maintaining 2 formats, the existing one that is for qlora and the additional new format as proposed above for the full train?

aakankshaduggal commented 4 months ago

Yes you are spot on @oindrillac https://github.com/instructlab/sdg/blob/45ecc73ada3d8a06b246f21ebe87b5a07b206654/src/instructlab/sdg/generate_data.py#L80-L98

We would want to have two separate outputs, one in the format the CLI expects for legacy training, and one for the new version that would expect the messages format.

oindrillac commented 4 months ago

cool, and based on whether pipeline == "full" or pipeline == "simple" we can enforce the final output to be in a certain required format?

aakankshaduggal commented 4 months ago

Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.

russellb commented 4 months ago

cool, and based on whether pipeline == "full" or pipeline == "simple" we can enforce the final output to be in a certain required format?

or we could just always produce both?

nathan-weinberg commented 4 months ago

Yes! @RobotSail mentioned that we just need to pass the data_file_path so for full pipeline mode we can mention the messages file instead.

For what's it worth, on the eval side we just need a path to a directory where the messages data is going to be dumped to - similar to the generate directory the CLI dumps data to currently.

oindrillac commented 4 months ago

It would be in the same directory and we can produce both formats as per @russellb suggestion