vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.85k stars 431 forks source link

Suggestions for future model #53

Open Vigilence opened 6 months ago

Vigilence commented 6 months ago

Thank you for your model!

If you want to get inspiration from the best multimodal atm for future improvements for moondream, check out qwen-vl-max.

If you want to squeeze out more details from your model maybe you can have it auto slice the image into several pieces, caption each slice, then caption the whole image, then combine all the captions into a single caption.

I do this manually (cropping the image several times in photoshop) for problem images and I am able to get the model to see details it would normally miss or ignore.

vikhyat commented 5 months ago

Thanks!

The model currently only sees a 378x378 version of any images fed in, so it's not going to be able to see any fine grained details. I don't want to force multi-cropping in the way LLaVA-1.6 did because it increases FLOPs/image too much. Will definitely look into making it an option the user can opt into.