Naive Question - Githubissues

shersoni610 commented 9 months ago

Hello

sorry for the naive question:

If Blip is doing the captioning part, what is Mistral doing?

shivanshkaushikk commented 9 months ago

Hello

sorry for the naive question:

If Blip is doing the captioning part, what is Mistral doing?

Hey, We are using BLIP as a tool here, we can say that it is just a part of the arsenal of weapons we have, over which we put an inference layer (i.e. Mistral). By inference layer, I mean that Mistral's (or any other LLM's) job is to understand user query and respond in the most relevant way, for e.g - let's say we have an audio transcription tool, image captioning tool and a RAG tool, whatever question the user asks, it is the job of our LLM Agent to understand it and call the relevant tool. What I have is a very basic use case, but the goal was to show how we can build complex custom tools and use a decent LLM to build an agent (also, run it on CPU).

Hope this answers your query, let me know if there's any other doubt!

Thanks

shersoni610 commented 9 months ago

Thanks. If I understand correctly that you want mistral to answer questions based on the blip output. But blip itself have shortcoming and mistral may exaggerate the weakness of blip. I would prefer llava2 over blip.

shivanshkaushikk commented 9 months ago

Thanks. If I understand correctly that you want mistral to answer questions based on the blip output. But blip itself have shortcoming and mistral may exaggerate the weakness of blip. I would prefer llava2 over blip.

You understood this correctly! I used BLIP just because I have worked with it in the past and wanted to show a demo of image captioning tools, but its totally up to you if you want to use a larger or a better model!

shivanshkaushikk / mistral-image-captioning-agent

Naive Question #1