Add support for ShareGPT-formatted datasets

lewtun commented 1 day ago

Feature request

Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like system, user, and assistant in a list of messages as follows:

messages = [
    {"role": "system", "content": "You are AGI"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "What is my purpose?"},
]

However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a conversations field and include the following roles:

system
human (same as user in the OpenAI spec)
gpt (same as assistant in the OpenAI spec)

Here's an example:

conversations = [
    {"from": "system", "value": "You are AGI"},
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "What is my purpose?"},
]

Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?

Motivation

Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:


sharegpt_role_mapping = {"system": "system", "human": "user", "gpt": "assistant"}

def create_messages(
    x,
    system_column: str = None,
    prompt_column: str = None,
    completion_column: str = None,
    share_gpt_column: str = None,
):
    """Create messages in H4 format"""
    if prompt_column is not None and completion_column is not None:
        x["messages"] = []
        if system_column is not None:
            x["messages"].append({"role": "system", "content": x[system_column]})
        x["messages"].extend(
            [{"role": "user", "content": x[prompt_column]}, {"role": "assistant", "content": x[completion_column]}]
        )
    elif share_gpt_column is not None:
        x["messages"] = []
        for msg in x[share_gpt_column]:
            x["messages"].append({"role": sharegpt_role_mapping[msg["from"]], "content": msg["value"]})
    # No need to format messages if they are already in the right format
    elif "messages" in x:
        return x
    else:
        raise ValueError("Dataset does not have the expected columns.")
    return x

ds = load_dataset(script_args.dataset_name)

ds = ds.map(
    create_messages,
    fn_kwargs={
        "system_column": script_args.system_column,
        "prompt_column": script_args.prompt_column,
        "completion_column": script_args.completion_column,
        "share_gpt_column": script_args.sharegpt_column,
    },
    num_proc=script_args.num_proc,
)

Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.

Your contribution

Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.

qgallouedec commented 1 day ago

I think this will help you. https://mega.co.nz/... I put the necessary dlls in the archive

Please don't use external links. Reported for security reasons.

qgallouedec commented 1 day ago

Would it make sense to include a mapping within TRL that detects the ShareGPT

I think so.

For context, trainer are expected to support conversational dataset, see #2071.

We can support ShareGPT at several level:

Trainers : trainers would allow both Open AI spec and ShareGPT.
Scripts : the example scripts would make sure to convert the dataset into OpenAI spec format if needed. TRL would provide a util function to convert to OAI format before passing the dataset to the trainer.

I'm more aligned with the option 2.

We could add the following line to our scripts:

dataset = dataset.map(maybe_convert_to_sharegpt, remove_columns=dataset.column_names)

lewtun commented 1 day ago

I like option 2 as well - it simplifies maintenance of the core trainer logic. I'll implement something for the examples scripts

huggingface / trl