microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.08k stars 2.43k forks source link

Is it possible to share the Kosmos-2.5 script that converts inference outputs to output image #1547

Closed ydshieh closed 1 month ago

ydshieh commented 1 month ago

Hi! First, thank you for open sourcing the Kosmos-2.5 model.

I have run it and it run smoothly.

The inference outputs look like

{'model': 'kosmos 2.5', 'task': 'ocr', 'width': 772, 'height': 1000, 'results': [{'text': '9', 'bounding box': {'x0': 702, 'y0': 35, 'x1': 708, 'y1': 48}}, {'text': '0', 'bounding box': {'x0': 103, 'y0': 250, 'x1': 107, 'y1': 262}}, {'text': 'π/2', 'bounding box': {'x0': 162, 'y0': 250, 'x1': 176, 'y1': 262}}, 

and I couldn't find the script to convert this to the output image like shared in the repository.

Is it possible for the team to share that conversion script? Thank you in advance!

Dod-o commented 1 month ago

hi @ydshieh , I have uploaded the script we used for drawing bounding boxes, please find it here : )

ydshieh commented 1 month ago

Thank you a lot ^_^ !

atlury commented 1 month ago

on what hardware are you running it @ydshieh ? Can you share some details. I am planning to buy old P40s but seems Kosmos-2.5 depends on Flash Attention-2 which only works on RTX 30x0 and RTX 40x0 and some high end cards like A100, H100.

ydshieh commented 1 month ago

I tried it once on a A10 machine.