sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
158 stars 8 forks source link

Multiple Image QA Model #4

Closed tsdocode closed 7 months ago

tsdocode commented 7 months ago

Hello @sshh12,

I wanted to express my gratitude for your incredible work! I've been searching around and couldn't find any model that supports multiple image inference, but I came across your idea in the suggestion section. If I want to begin working on it based on your work, where should I start? Do you have any suggestions?

Thank you!

sshh12 commented 7 months ago

Hey! In theory, if you have a dataset, it should be super easy just following the "Training" section of the README. You can re-use the vision_clip modality so the only thing you'd need to implement is a dataset with QA on multiple images.

{
    "id": "arbitrary-id-123",
    "images": ["/path/to/image.png", "/path/to/image.png"],
    "messages": [{"role": "user", "content": "What is the difference between <image> and <image>"}, {"role": "assistant", "content": "They have different colors."}],
}
sshh12 commented 7 months ago

You can see https://github.com/sshh12/multi_token/blob/main/scripts/llava_build_finetune_dataset.py for getting the dataset into the correct format to match the example above.

tsdocode commented 7 months ago

Thank u, I will try this way 🚀