pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.24k stars 417 forks source link

Create multimodal instruction dataset builder #1704

Open RdoubleA opened 1 month ago

RdoubleA commented 1 month ago

We current have multimodal_chat_dataset which is great for conversations on an image, but many VQA datasets are structured more like instructions where there is a question column, answer column, and image column (see VQA datasets on HF, sort by most downloads). We should add a multimodal_instruct_dataset builder to support these types of datasets from the configs.

krammnic commented 2 weeks ago

So, it looks like that we can check if the data is multimodal in InputOutToMessages and then choose column mapping according to result of this check. Then simply build messages like done in basic case adding to this something like sample[self._column_map["image"]]. Finally, we return messages and in multimodal_instruct_dataset just build SFTDataset with this transform. Probably will open PR on it.