vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.57k stars 405 forks source link

Memory/continuity? #13

Open Yogevsho opened 5 months ago

Yogevsho commented 5 months ago

Sorry if this is a silly question, but is it possible for the model to somehow keep a memory of the previous images it got? To put it simply, can i give it different frames from a single video and it would answer questions understanding that it is a single video?

vikhyat commented 5 months ago

Hello! The current model was only trained on a single image at a time, but I'm definitely interested in training a version that can operate on multiple frames of a video. Can I ask what you're planning to use it for, so I can make sure to incorporate it into the training data?

Forest-Person commented 5 months ago

Hello! The current model was only trained on a single image at a time, but I'm definitely interested in training a version that can operate on multiple frames of a video. Can I ask what you're planning to use it for, so I can make sure to incorporate it into the training data?

I want to use it for small rpi robots if that helps you. Having a little video I put and info about that input would be revolutionary for small home scale robotics.

sujitvasanth commented 5 months ago

I wonder if at present you can use opencv to stitch the 2 images and ask for moondream to find differences between say left and right images. programmatically you can use it then potentially to compare objects, faces. Im going to try this would be much better for the AI to do this within the NN also would be really helpful to be able to fine-tune. so for the data set it would be really helpful to be able to compare 2 images i.e. is the same person in both images - I tried with 2 7 up cans one cherry and regular but it couldn't tell the subtle differences and thought they were the same.

shortcipher3 commented 2 months ago

Perhaps related, but I would love to use this in a RAG fashion, where I can prompt the AI to answer some question about a collection of images and it can find the most relevant images to the query, incorporate those into its context to answer the question. Seems like you could simply store your SigLIP embeddings of the images into a db, when a query is provided find the matching results and add that to the context of the language model.

Imagine for example a having a bunch of frames from a movie - you can now ask questions about the movie and hopefully get intelligent responses - eg when did X first appear in the movie, describe the scenes with actor Y, when did Z make his cameo, etc.

Here's another application - imagine having hours of sports footage and being able to ask when the touchdowns happened in a football game or when injuries happened, or assists, or to be able to get all the plays made by player 12, etc.

Another application would be asking about your photo album collection - seems like Google/Amazon photos already have some search capability, but you could answer questions like when did I visit Seattle? Or what are some of the names of the places I visited in Seattle. Or what was the name of that park I visited in Seattle. I guess some of those it would be nice to be able to incorporate the image metadata, but even without it I think you could answer some interesting questions.

Tsardoz commented 4 weeks ago

On sequential images. I have found getting Moondream to caption each image separately works. Then you send the captions to a LLM and ask it a question about them.

EDIT: The SLM in Moondream is poor at answering questions but good at image analysis. So you split the tasks with an LLM that is good at answering questions about text.