Open RdoubleA opened 1 week ago
Naming looks weird actually
Yeah I'm not sure that this will make sense in all scenarios.
For example, we have OpenAIToMessages
, which includes follows the OpenAI chat completion spec exactly - this includes text+image, image, and tool calling. It wouldn't make sense to have separate ToMessage
converters for this. Same goes for ShareGPT
.
However, we might want to limit our provided ToMessage
to these kinds of formats and do away with our reliance on InputOutputToMessages
altogether - I find it very confusing to reason about.
This is somewhat tangential but the proposal would be:
ChatToMessage
that can map columnsToMessage
converter that can be easily loaded into a dataset builder via configA general ChatToMessage that can map columns
This is what InputOutputToMessages
and ChosenRejectedToMessages
are intended to be (for instruct and preference data). Isn't this basically what we currently have?
I don't think it makes sense to have special ToMessage options for different modalities, this would get out of hand very quickly. Our message format can already support basically any modality, so our included ToMessage function should support any modality we directly support with all modalities being optional. I agree with @joecummings on this one.
So is the proposal to continue as is, adding logic for each modality for all of the generic converters?
Maybe we could separate out modality specific logic as utilities.
So is the proposal to continue as is, adding logic for each modality for all of the generic converters?
Maybe we could separate out modality specific logic as utilities.
Sounds reasonable, both "split" and "save current logic" have similar problems, we need to control some simplicity at this point
Discussion for context: https://github.com/pytorch/torchtune/pull/1820#discussion_r1804488044
Both
ShareGPTToMessages
andInputOutputToMessages
support image data if specified. However, this is leading to increasingly complex logic in the class and could potentially become further bloated once we add support for more modalities. For example, if we added support for audio, we would need to add an audio column to column map, add logic for loading the audio files into the Message, and this compounds the more modalities we support. It's also not immediately clear that these support image data unless a user looks through docstrings carefully.An alternative approach would be to have separate transforms for text-only, text+image, text+other modality, etc:
The drawback of this is that more modalities means more permutations we may have to create message transforms for. Open to other options.
cc @SalmanMohammadi @joecummings @krammnic