microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.49k stars 2.48k forks source link

BIET-3 self-attention is not shared across modality #1109

Open xinghua-qu opened 1 year ago

xinghua-qu commented 1 year ago

Hi,

I found that the model used in BIET-3 based on torchscale is not as what the paper described. In the multiway transformer, the self-attention layer should be shared across different modality. However, it is not the case in the implementation as the screenshot shown. It seems there are two parallel self attention layers (A and B), instead of one shared.

image
wenhui0924 commented 1 year ago

Hi @xinghua-qu,

Please refer to Section 11 at BEiT-3 Supp.

We have done some architecture explorations and found that we can decouple attention parameters while still maintaining the ability to perform deep fusion. So we released models using this architecture.