vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.5k stars 397 forks source link

Moondream for Object Localization #101

Open E5GEN2 opened 1 month ago

E5GEN2 commented 1 month ago

I am wondering if Moondream can be used for grounding tasks such Object Localization? Something similar to what cogagent does with GUI but I would like to train on my custom dataset. If I fine-tune moondream on my custom dataset of images - bounding boxes + text is there a chance it would work?

vikhyat commented 1 month ago

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

E5GEN2 commented 1 month ago

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

what if i have a dataset of images + actions i.e. {"x1": 420, "x2": 378, "y1": 1042, "y2": 245, "action": "swipe", "duration": 200}

would it be able to predict such actions if i train it on my dataset?

Is it possible to predict a next action for a sequence of images + actions? If not, what if I create a collage image of previous images + actions. Would it be able to learn such a task?

Shalom-P commented 1 month ago

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.