[BEIT-3] How to generate multi-modal feature with the pretrained beit-3 model?

sjtulyf123 commented 1 year ago

Dear authors, thanks for your fantastic work. I wonder how can we generate multi-modal representation with the pretrained beit-3 model?

To be specific, suppose we have the pretrained beit-3 model, and an image with its caption. We want to get the representation of both the image and its caption, which way described below is correct?

Method 1: Input image and caption simultaneously, get output[0, :] as image representation and output[multiway_split_position, : ] as text representation (similar to https://github.com/microsoft/unilm/blob/master/beit3/modeling_finetune.py#L83).

Method 2: Input image only and get output[0, :] as image representation; Then input caption only and get output'[0, :] as text representation (similar to https://github.com/microsoft/unilm/blob/master/beit3/modeling_finetune.py#L244).

Furthermore, if we want to generate image and text features with modality-fused information (image feature considering its text info and text feature considering its image info), want can we do for that?

wenhui0924 commented 1 year ago

If your task requires deep fusion of images and texts, I think Method 1 is better. But for a large number of unpaired images and texts, Method 2 is better. For Method 1, you can also try to use the average of all image/text token hidden states.
For generating image and text features with modality-fused information, you can try Method 1.

sjtulyf123 commented 1 year ago

If your task requires deep fusion of images and texts, I think Method 1 is better. But for a large number of unpaired images and texts, Method 2 is better. For Method 1, you can also try to use the average of all image/text token hidden states.

For generating image and text features with modality-fused information, you can try Method 1.

For method 1, the multiway_split_position is actually a [bos] token? Does it contains the necessary text information?

wenhui0924 commented 1 year ago

Yes, It is the [bos] token. If you use the pretrained beit3 model without finetuning, I am not sure whether using the [bos] token or the average representation is better.

sjtulyf123 commented 1 year ago

Thanks for your reply, we will try these methods later.

microsoft / unilm

[BEIT-3] How to generate multi-modal feature with the pretrained beit-3 model? #1343