Open lewtun opened 1 day ago
I think this will help you. https://mega.co.nz/... I put the necessary dlls in the archive
Please don't use external links. Reported for security reasons.
Would it make sense to include a mapping within TRL that detects the ShareGPT
I think so.
For context, trainer are expected to support conversational dataset, see #2071.
We can support ShareGPT at several level:
I'm more aligned with the option 2.
We could add the following line to our scripts:
dataset = dataset.map(maybe_convert_to_sharegpt, remove_columns=dataset.column_names)
I like option 2 as well - it simplifies maintenance of the core trainer logic. I'll implement something for the examples scripts
Feature request
Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like
system
,user
, andassistant
in a list of messages as follows:However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a
conversations
field and include the following roles:system
human
(same asuser
in the OpenAI spec)gpt
(same asassistant
in the OpenAI spec)Here's an example:
Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?
Motivation
Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:
Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.
Your contribution
Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.