Open siavashvj opened 6 months ago
same!
Gemini Vision model is limited as a single-message model. Gemini Flash supports conversational multi-message scenario though.
Recently, the scenario you mention is also fixed, using the last image message. So effectively, when using Gemini Vision, it can't account for every image it has seen and come up with a high-level answer on all images provided. However, it's able to keep on responding right now.
Other than Gemini limitation, I just tested Anthropic side with Haiku and it's working fine. Anthropic model also (as with Gemini Flash) can account for all the images it's seen.
My understanding is that Gemini Vision is just a transitional model, while Gemini 1.5 Pro and Gemini 1.5 Flash already support multi-turn conversations with images.
I have made some temporary modifications to enable the use of the new models and support multi-turn image conversations.
ps: This modification is just a quick fix to achieve the goal. I hope the project author can optimize it to be simpler.
This is already merged. It's working in main.
@NuerSir That was what I was trying to say. I've modified the message handling for Gemini Vision and upgraded the library to comply with this change. Main branch current state is working just fine.
Please check with current, up-to-date main branch.
When uploading an image to a vision model (Opus/Vision/4o) in a convo the image is added to the context just fine. But if subsequently you upload another image to the same convo no response is given.
Steps to reprodouce: