Open fakerybakery opened 8 months ago
You are referring to a batch size greater than 1 to apply each prompt to, correct?
Taggui supports moonbeam 1 and 2 as well as batch processing.
No, I mean asking one question about multiple images (ie if I have 3 images of animals, and I ask "what do these animals all have in common")
I see, using several images in a single question/prompt. That's actually a great idea!
I guess technically one could preprocess multiple images into one (ie a grid of images) but the model isn't trained on that and the resolution for each image would be quite low
Would help to get examples of more real world use-cases for this. I'm definitely open to adding support for it, just need to understand what types of training data to generate.
I was mainly thinking about images from multiple perspectives, image choices, etc. For example, what if the security camera example on your homepage could support multiple images? Also would be great for healthcare use-cases
Moondream is truly impressive, and your suggestion is fantastic! There have been times when I wanted to have multi-turn dialogues that need to uploading several images, all while having moondream remember the context and images from prior conversations. For example, it would be incredibly convenient to fluidly discuss different movie elements, like scenes and characters, in an ongoing dialogue. Hoping to see this feature become a reality!
Would help to get examples of more real world use-cases for this. I'm definitely open to adding support for it, just need to understand what types of training data to generate.
It could be useful for problems involving multiple figure reasoning. For example, finding the differences between two images, or describing small details of a scene where you're given multiple angles, which may help reasoning.
I agree, I would like to pas in multiple images and ask if it is the same person in both images or pass in multiple images of a scene and ask what has changed. Or pass in multiple images from a property and ask what is happening. Or pass multiple images from an ADAS system and ask what is happening.
I was hoping something like amblegpt (https://github.com/mhaowork/amblegpt) can be supported with a self-hosted moondream backend (as a replacement for a gpt4o backend).
Thoughts?
# Video frame sampling settings
GAP_SECS = 3
# GPT config
DEFAULT_PROMPT = """
You're a helpful assistant helping to label a video for machine learning training
You are reviewing some continuous frames of a video footage as of {EVENT_START_TIME}. Frames are {GAP_SECS} second(s) apart from each other in the chronological order.
{CAMERA_PROMPT}
Please describe what happend in the video in json format. Do not print any markdown syntax!
Answer like the following:
{{
"num_persons" : 2,
"persons" : [
{{
"height_in_meters": 1.75,
"duration_of_stay_in_seconds": 15,
"gender": "female",
"age": 50
}},
{{
"height_in_meters": 1.60,
"duration_of_stay_in_seconds": 15,
"gender": "unknown",
"age": 36
}},
"summary": "SUMMARY"
"title": "TITLE"
}}
You can guess their height and gender . It is 100 percent fine to be inaccurate.
You can measure their duration of stay given the time gap between frames.
You should take the time of event into account.
For example, if someone is trying to open the door in the middle of the night, it would be suspicious. Be sure to mention it in the SUMMARY.
Mostly importantly, be sure to mention any unusualness considering all the context.
Some example SUMMARIES are
1. One person walked by towards right corner with her dog without paying attention towards the camera's direction.
2. One Amazon delivery person (in blue vest) dropped off a package.
3. A female is waiting, facing the door.
4. Suspicious: A person is wandering without obvious purpose in the middle of the night, which seems suspicious.
5. Suspicious: A person walked into the frame from outside, picked up a package, and left.
The person didn't wear any uniform so this doesn't look like a routine package pickup. Be aware of potential package theft!
TITLE is a one sentence summary of the event. Use no more than 10 words.
Write your answer in {RESULT_LANGUAGE} language.
"""
Would help to get examples of more real world use-cases for this. I'm definitely open to adding support for it, just need to understand what types of training data to generate.
hello @vikhyat real-world use case is multimodal RAG in which multiple 'pages' / image chunks from different documents are returned. An approach can be to stitch them together but it's hacky.
Moondream2 is incredible! Is it possible to support multiple images? Thanks!