Closed sjtulyf123 closed 1 year ago
- If your task requires deep fusion of images and texts, I think Method 1 is better. But for a large number of unpaired images and texts, Method 2 is better. For Method 1, you can also try to use the average of all image/text token hidden states.
- For generating image and text features with modality-fused information, you can try Method 1.
For method 1, the multiway_split_position is actually a [bos] token? Does it contains the necessary text information?
Yes, It is the [bos] token. If you use the pretrained beit3 model without finetuning, I am not sure whether using the [bos] token or the average representation is better.
Thanks for your reply, we will try these methods later.
Dear authors, thanks for your fantastic work. I wonder how can we generate multi-modal representation with the pretrained beit-3 model?
Method 1: Input image and caption simultaneously, get output[0, :] as image representation and output[multiway_split_position, : ] as text representation (similar to https://github.com/microsoft/unilm/blob/master/beit3/modeling_finetune.py#L83).
Method 2: Input image only and get output[0, :] as image representation; Then input caption only and get output'[0, :] as text representation (similar to https://github.com/microsoft/unilm/blob/master/beit3/modeling_finetune.py#L244).