Let's brainstorm some models we could integrate in our benchmark, starting with models that can handle image input (with or without text), i'll go first, but you can suggest any model by commenting down there:
Gemini Vision (available for free at 60 requests per minute)
llava-13b (Replicate)
minigpt-4 (replicate)
GPT-4V (a few $$ or manually through playgrounds like lmsys)
Let's brainstorm some models we could integrate in our benchmark, starting with models that can handle image input (with or without text), i'll go first, but you can suggest any model by commenting down there: