Closed violet-sto closed 1 year ago
Hi @violet-sto,
We have done some explorations on model architecture and find using the decoupled Multiway Transformer achieves better performance than our previous architecture. Besides removing VL-FFN, we also decoupling the shared self-attention module. We will update these experiments on Arxiv later.
Thanks for your reply!
I'm not sure I fully understand decoupled Multiway. Does this mean that for image-text pairs, you still separately use V-FFN and L-FFN on the top layers, instead of VL-FFN?
Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.
Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.
Inspired by this issue, I carefully debugged the code of beit3 to finetune the VQAv2 dataset. I found that as you said, both the self-attention layer and ffn (the red box in the figure below) are decoupled. The number of encoder layers (L-F) is 12. But I found that the following code does not have the fusion encoder part, which is the red box part in the figure below Please can you tell me why the fusion encoder in the paper is missing? I will be very grateful if you can answer my doubts !
Please refer to the implementation in torchscale. Specifically, we use different parameters (including self-attention and FFN) for image and text encoder, but the model can still perform deep fusion for vision-language tasks.
Will you release models with shared attention?
Hi @violet-sto,
We have done some explorations on model architecture and find using the decoupled Multiway Transformer achieves better performance than our previous architecture. Besides removing VL-FFN, we also decoupling the shared self-attention module. We will update these experiments on Arxiv later.
So there are not VL-FFN in the released checkpoints? I noticed that in the released checkpoints, there are two sets of parameters "A" and "B". "A" parameters are vision related parameters and "B' parameters are language related paramters?
Describe BEIT3: Hi, I notice that you use decoupled Multiway Transformer as the backbone architecture. However, in your paper (Arxiv version), there are three experts (V-FFN, L-FFN, and VL-FFN). Does decoupling mean you removed VL-FFN?