microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

How to input bounding boxes to Kosmos-2? #1468

Open joshmyersdean opened 7 months ago

joshmyersdean commented 7 months ago

Describe the bug Model I am using: Kosmos-2

There does not appear to be documentation on how to provide bounding boxes to Kosmos-2. e.g., to reproduce Figure 2(4).

Thank you!

yutojubako commented 5 months ago

I am also interested in this issue, which seems to be unresolved: how can I include the bounding box in the prompt and have it generate a response? I look forward an answer.

pengzhiliang commented 5 months ago

@yutojubako @joshmyersdean Thank you for your attention.

To include the bounding box in the prompt, you need to quantize the coordinates of the input box in advance to get the location token, and then use it as input (note to follow the "link" format). Personally, I would first input a detailed caption as a prompt to get a location token corresponding to an instance. Then use the obtained location token as input for subsequent VQA.

yutojubako commented 4 months ago

@pengzhiliang Thank you for your answer. Now I have the quantized bounding box in the prompt and can convert it to a location token. However, I want to generate a detailed caption in the bounding box represented as a location token, but it returns the same string as the input. I think that the token inside the prompt is not recognized correctly. Is there anything else I should be paying attention to?(Or if you could give me the code to reproduce the figure 2(4) that @joshmyersdean mentioned, that would be great...)

jujeongho0 commented 4 months ago

@pengzhiliang Can I get a information of "link" format? Or code to change coordinates into location token.