Separate `ToMessages` transforms for text+image

pytorch / torchtune

PyTorch native finetuning library

https://pytorch.org/torchtune/main/

BSD 3-Clause "New" or "Revised" License

4.21k stars 411 forks source link

Separate `ToMessages` transforms for text+image #1862

Open RdoubleA opened 1 week ago

RdoubleA commented 1 week ago

Discussion for context: https://github.com/pytorch/torchtune/pull/1820#discussion_r1804488044

Both ShareGPTToMessages and InputOutputToMessages support image data if specified. However, this is leading to increasingly complex logic in the class and could potentially become further bloated once we add support for more modalities. For example, if we added support for audio, we would need to add an audio column to column map, add logic for loading the audio files into the Message, and this compounds the more modalities we support. It's also not immediately clear that these support image data unless a user looks through docstrings carefully.

An alternative approach would be to have separate transforms for text-only, text+image, text+other modality, etc:

ShareGPTToMessages, ShareGPTImageToMessages
InputOutputToMessages, InputOutputImageToMessages

The drawback of this is that more modalities means more permutations we may have to create message transforms for. Open to other options.

cc @SalmanMohammadi @joecummings @krammnic

krammnic commented 1 week ago

Naming looks weird actually

joecummings commented 1 week ago

Yeah I'm not sure that this will make sense in all scenarios.

For example, we have OpenAIToMessages, which includes follows the OpenAI chat completion spec exactly - this includes text+image, image, and tool calling. It wouldn't make sense to have separate ToMessage converters for this. Same goes for ShareGPT.

However, we might want to limit our provided ToMessage to these kinds of formats and do away with our reliance on InputOutputToMessages altogether - I find it very confusing to reason about.

This is somewhat tangential but the proposal would be:

Converters from common formats (openai, sharegpt, etc.)
A general ChatToMessage that can map columns
Clear guides on how to implement your own ToMessage converter that can be easily loaded into a dataset builder via config

RdoubleA commented 1 week ago

A general ChatToMessage that can map columns

This is what InputOutputToMessages and ChosenRejectedToMessages are intended to be (for instruct and preference data). Isn't this basically what we currently have?

pbontrager commented 1 week ago

I don't think it makes sense to have special ToMessage options for different modalities, this would get out of hand very quickly. Our message format can already support basically any modality, so our included ToMessage function should support any modality we directly support with all modalities being optional. I agree with @joecummings on this one.

RdoubleA commented 1 week ago

So is the proposal to continue as is, adding logic for each modality for all of the generic converters?

Maybe we could separate out modality specific logic as utilities.

krammnic commented 6 days ago

So is the proposal to continue as is, adding logic for each modality for all of the generic converters?

Maybe we could separate out modality specific logic as utilities.

Sounds reasonable, both "split" and "save current logic" have similar problems, we need to control some simplicity at this point