microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.6k stars 2.5k forks source link

Minimum code snippets to evaluate KOSMOS2 #1214

Open hewenbin opened 1 year ago

hewenbin commented 1 year ago

Hi,

Thank you so much for developing this impactful and impressive work! This work really bridges the gap in multimodal grounding capability to the visual world.

I would like to kindly ask if you can provide the simplest code snippets for phrase grounding tasks. Hopefully, this code snippet could enable us to experience the amazing phrase grounding capability of KOSMOS2 based on a single image and several noun phrases. I sincerely appreciate your time and help! Looking forward to hearing back from you.

BIGBALLON commented 1 year ago

Hi, @hewenbin

hewenbin commented 1 year ago

Thank you for the quick turnaround. I'm wondering is there any code example without using any GUI. For example, I want a bash script to run on a bunch of images.

yolandalalala commented 1 year ago

I would like to echo the same thing.

It would be very important to have a simple notebook tutorial with a few lines of code that allows us to evaluate the KOSMOS2 given a single image, text, and bounding boxes as input. Without that simple tutorial, it would be time-consuming to figure out how to adapt KOSMOS2 (instead of an interactive app) in other research projects. I sincerely hope this impactful work can be recognized and utilized by more and more people, such that it could significantly facilitate the research community of visual grounding. However, the learning curve of adapting this model to other research projects seems to be an obstacle.

Thank you so much for your great efforts in developing this amazing work!

BIGBALLON commented 1 year ago

Thank you for the quick turnaround. I'm wondering is there any code example without using any GUI.

@yolandalalala @hewenbin kosmos-2 is supported by huggingface team now