Closed tsdocode closed 7 months ago
Hey! In theory, if you have a dataset, it should be super easy just following the "Training" section of the README. You can re-use the vision_clip
modality so the only thing you'd need to implement is a dataset with QA on multiple images.
{
"id": "arbitrary-id-123",
"images": ["/path/to/image.png", "/path/to/image.png"],
"messages": [{"role": "user", "content": "What is the difference between <image> and <image>"}, {"role": "assistant", "content": "They have different colors."}],
}
You can see https://github.com/sshh12/multi_token/blob/main/scripts/llava_build_finetune_dataset.py for getting the dataset into the correct format to match the example above.
Thank u, I will try this way 🚀
Hello @sshh12,
I wanted to express my gratitude for your incredible work! I've been searching around and couldn't find any model that supports multiple image inference, but I came across your idea in the suggestion section. If I want to begin working on it based on your work, where should I start? Do you have any suggestions?
Thank you!