vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.56k stars 405 forks source link

Multiple images #56

Open fakerybakery opened 4 months ago

fakerybakery commented 4 months ago

Moondream2 is incredible! Is it possible to support multiple images? Thanks!

ProGamerGov commented 4 months ago

You are referring to a batch size greater than 1 to apply each prompt to, correct?

Vigilence commented 4 months ago

Taggui supports moonbeam 1 and 2 as well as batch processing.

fakerybakery commented 4 months ago

No, I mean asking one question about multiple images (ie if I have 3 images of animals, and I ask "what do these animals all have in common")

Vigilence commented 4 months ago

I see, using several images in a single question/prompt. That's actually a great idea!

fakerybakery commented 4 months ago

I guess technically one could preprocess multiple images into one (ie a grid of images) but the model isn't trained on that and the resolution for each image would be quite low

vikhyat commented 4 months ago

Would help to get examples of more real world use-cases for this. I'm definitely open to adding support for it, just need to understand what types of training data to generate.

fakerybakery commented 4 months ago

I was mainly thinking about images from multiple perspectives, image choices, etc. For example, what if the security camera example on your homepage could support multiple images? Also would be great for healthcare use-cases

zhaohm14 commented 4 months ago

Moondream is truly impressive, and your suggestion is fantastic! There have been times when I wanted to have multi-turn dialogues that need to uploading several images, all while having moondream remember the context and images from prior conversations. For example, it would be incredibly convenient to fluidly discuss different movie elements, like scenes and characters, in an ongoing dialogue. Hoping to see this feature become a reality!

ghost commented 3 months ago

Would help to get examples of more real world use-cases for this. I'm definitely open to adding support for it, just need to understand what types of training data to generate.

It could be useful for problems involving multiple figure reasoning. For example, finding the differences between two images, or describing small details of a scene where you're given multiple angles, which may help reasoning.

shortcipher3 commented 2 months ago

I agree, I would like to pas in multiple images and ask if it is the same person in both images or pass in multiple images of a scene and ask what has changed. Or pass in multiple images from a property and ask what is happening. Or pass multiple images from an ADAS system and ask what is happening.

saket424 commented 1 month ago

I was hoping something like amblegpt (https://github.com/mhaowork/amblegpt) can be supported with a self-hosted moondream backend (as a replacement for a gpt4o backend).

Thoughts?

# Video frame sampling settings
GAP_SECS = 3

# GPT config
DEFAULT_PROMPT = """
You're a helpful assistant helping to label a video for machine learning training
You are reviewing some continuous frames of a video footage as of {EVENT_START_TIME}. Frames are {GAP_SECS} second(s) apart from each other in the chronological order.
{CAMERA_PROMPT}
Please describe what happend in the video in json format. Do not print any markdown syntax!
Answer like the following:
{{
    "num_persons" : 2,
    "persons" : [
    {{
        "height_in_meters": 1.75,
        "duration_of_stay_in_seconds": 15,
        "gender": "female",
        "age": 50
    }},
    {{
        "height_in_meters": 1.60,
        "duration_of_stay_in_seconds": 15,
        "gender": "unknown",
        "age": 36
    }},
    "summary": "SUMMARY"
    "title": "TITLE"
}}

You can guess their height and gender . It is 100 percent fine to be inaccurate.

You can measure their duration of stay given the time gap between frames.

You should take the time of event into account.
For example, if someone is trying to open the door in the middle of the night, it would be suspicious. Be sure to mention it in the SUMMARY.

Mostly importantly, be sure to mention any unusualness considering all the context.

Some example SUMMARIES are
    1. One person walked by towards right corner with her dog without paying attention towards the camera's direction.
    2. One Amazon delivery person (in blue vest) dropped off a package.
    3. A female is waiting, facing the door.
    4. Suspicious: A person is wandering without obvious purpose in the middle of the night, which seems suspicious.
    5. Suspicious: A person walked into the frame from outside, picked up a package, and left.
       The person didn't wear any uniform so this doesn't look like a routine package pickup. Be aware of potential package theft!

TITLE is a one sentence summary of the event. Use no more than 10 words.

Write your answer in {RESULT_LANGUAGE} language.
"""