Open shersoni610 opened 9 months ago
Hello
sorry for the naive question:
If Blip is doing the captioning part, what is Mistral doing?
Hey, We are using BLIP as a tool here, we can say that it is just a part of the arsenal of weapons we have, over which we put an inference layer (i.e. Mistral). By inference layer, I mean that Mistral's (or any other LLM's) job is to understand user query and respond in the most relevant way, for e.g - let's say we have an audio transcription tool, image captioning tool and a RAG tool, whatever question the user asks, it is the job of our LLM Agent to understand it and call the relevant tool. What I have is a very basic use case, but the goal was to show how we can build complex custom tools and use a decent LLM to build an agent (also, run it on CPU).
Hope this answers your query, let me know if there's any other doubt!
Thanks
Thanks. If I understand correctly that you want mistral to answer questions based on the blip output. But blip itself have shortcoming and mistral may exaggerate the weakness of blip. I would prefer llava2 over blip.
Thanks. If I understand correctly that you want mistral to answer questions based on the blip output. But blip itself have shortcoming and mistral may exaggerate the weakness of blip. I would prefer llava2 over blip.
You understood this correctly! I used BLIP just because I have worked with it in the past and wanted to show a demo of image captioning tools, but its totally up to you if you want to use a larger or a better model!
Hello
sorry for the naive question:
If Blip is doing the captioning part, what is Mistral doing?