Open RdoubleA opened 1 month ago
So, it looks like that we can check if the data is multimodal in InputOutToMessages
and then choose column mapping according to result of this check. Then simply build messages like done in basic case adding to this something like sample[self._column_map["image"]]
. Finally, we return messages and in multimodal_instruct_dataset
just build SFTDataset with this transform. Probably will open PR on it.
We current have
multimodal_chat_dataset
which is great for conversations on an image, but many VQA datasets are structured more like instructions where there is a question column, answer column, and image column (see VQA datasets on HF, sort by most downloads). We should add amultimodal_instruct_dataset
builder to support these types of datasets from the configs.InputOutToMessages
with image support