[COLLABORATIVE] List of vision-enabled models

Let's brainstorm some models we could integrate in our benchmark, starting with models that can handle image input (with or without text), i'll go first, but you can suggest any model by commenting down there:

Gemini Vision (available for free at 60 requests per minute)
llava-13b (Replicate)
minigpt-4 (replicate)
GPT-4V (a few $$ or manually through playgrounds like lmsys)
mplug-owl
qwen-vl-chat
idefics (HF reproduction of Flamingo)

vienneraphael / layton-eval

[COLLABORATIVE] List of vision-enabled models #6