microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.57k stars 2.49k forks source link

BEIT3: multiway transformer #1041

Closed violet-sto closed 1 year ago

violet-sto commented 1 year ago

Describe BEIT3: Hi, I notice that you use decoupled Multiway Transformer as the backbone architecture. However, in your paper (Arxiv version), there are three experts (V-FFN, L-FFN, and VL-FFN). Does decoupling mean you removed VL-FFN?

wenhui0924 commented 1 year ago

Hi @violet-sto,

We have done some explorations on model architecture and find using the decoupled Multiway Transformer achieves better performance than our previous architecture. Besides removing VL-FFN, we also decoupling the shared self-attention module. We will update these experiments on Arxiv later.

violet-sto commented 1 year ago

Thanks for your reply!

I'm not sure I fully understand decoupled Multiway. Does this mean that for image-text pairs, you still separately use V-FFN and L-FFN on the top layers, instead of VL-FFN?

wenhui0924 commented 1 year ago

Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.

LIRENDA621 commented 1 year ago

Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.

Inspired by this issue, I carefully debugged the code of beit3 to finetune the VQAv2 dataset. I found that as you said, both the self-attention layer and ffn (the red box in the figure below) are decoupled. The number of encoder layers (L-F) is 12. image But I found that the following code does not have the fusion encoder part, which is the red box part in the figure below Please can you tell me why the fusion encoder in the paper is missing? image I will be very grateful if you can answer my doubts !

zcuncun commented 1 year ago

Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.

Will you release models with shared attention?

Cppowboy commented 1 year ago

Hi @violet-sto,

We have done some explorations on model architecture and find using the decoupled Multiway Transformer achieves better performance than our previous architecture. Besides removing VL-FFN, we also decoupling the shared self-attention module. We will update these experiments on Arxiv later.

So there are not VL-FFN in the released checkpoints? I noticed that in the released checkpoints, there are two sets of parameters "A" and "B". "A" parameters are vision related parameters and "B' parameters are language related paramters?