microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.19k stars 2.55k forks source link

[Kosmos-2] Inputs/prompts for reproducing paper examples #1234

Open bryanoliveira opened 1 year ago

bryanoliveira commented 1 year ago

Hi! First of all thank you for the amazing work on building Kosmos-2!

I have some questions regarding the prompt format used for generating the paper examples.

I'm trying to use the model directly with a modified version of the demo code, which I made available here. My idea is to first reproduce the examples of the paper and then extend the model for my research, which would involve fine-tuning it with RL.

I understand that for Figure 1 I can feed this into the model:

<s> <image>Embedding</image> <grounding> <phrase>It</phrase><object><patch_index_0078><patch_index_0796></object> seats next to

And the model outputs something like "the campfire" (even though I couldn't get it to generate the bounding box for "campfire" correctly).

However, I couldn't get a prompt that would reproduce any of the chat examples in Figures 10 or 11. I tried variations of this:

<s> <image>Embedding 1</image><grounding>This is a downy woodpecker. <image>Embedding 2</image><grounding><phrase>a downy woodpecker</phrase>

But the model outputs <object><patch_index_0032><patch_index_1007></object>, which corresponds to the wrong bird.

Can you help me by showing how a prompt for chat and multiple images would look like?

pengzhiliang commented 1 year ago

Hi, @bryanoliveira . Thanks for your attention! Sorry for the late response.

As you may know, Kosmos-2 is a generative model, and we actually cannot control the model to produce bounding boxes for each noun in the response (perhaps this issue can be addressed later with higher quality instruction tuning data).

However, there is a simple method to obtain bounding boxes, for example:

<phrase>It</phrase><object><patch_index_0078><patch_index_0796></object> sits next to <phrase>

Additionally, we enabled sampling in the demo and adjusted the sampling parameters to obatin figures in the paper, like many other LLMs.

Thanks again~

bryanoliveira commented 1 year ago

Hi! Thanks for the response! It makes a lot of sense to add <phrase> to the prompt.

But what would be the correct way to use multiple images in a prompt? How could I reproduce the chat examples in Figures 10 and 11?

Thank you in advance.