Architecture of ALBEF - Githubissues

Asaad-Pak commented 3 months ago

Hello I would like to do some experiments using ALBEF model. For this I reviewed your paper as well, but I am unable to understand why first six layers of bert base was used as text encoder and why last six layers are used as multimodal encoder? Why didn't the entire BERT_base with all 12 layers was used as text encoder and multimodal encoder? Your help in this regard would be greatly appreciated. @LiJunnan1992 @svc-scm @chenxwh

phphuc612 commented 2 months ago

I think you can refer to the review of this paper for this architecture question. Reference: Reviewer XRzR here.

Asaad-Pak commented 2 months ago

@phphuc612 yeah I have read that reviews and the review you are talking about i am writing it here: "Reviewer Question: It is not specified why the text encoder is initialized with the first 6 layers of a pretrained BERT-base model while the multimodal encoder is initialized using the last 6 layers. Author's response: We made a straightforward design choice where the text encoder and the multimodal encoder have the same number of layers, hence each of them inherits 6 layers from a BERT-base (12 layer) model. We leave it as future work to explore other architectural designs."

I understand that they wanted the text encoder and multimodal encoder to have similar architecture. But I thought there might be some specific reason of doing this. I mean why 6 layers of the same architecture BERT base like they could have used all 12 layers of bert base as text encoder and 12 layers of same bert base or some other bert as multimodal encoder. Using same architecture bert base might overfit the model this is what I think. But using 2 different bert models like bert base as text encoder and distillbert as multimodal encoder could be considered. Also the multimodal encoder should have more layers as its task is comparitively difficult from text encoder like understanding the complex interactions between image and text. So what are you thoughts on these points? I would be happy to hear from you. :)

phphuc612 commented 2 months ago

Thanks for sharing a new insight @Asaad-Pak.

From my view point, I more focused on the contribution of the authors in learning paradigm design instead of architecture design: (1) "momentum distillation" to train momentum contrast and find hard negative sample for image-text matching, and (2) their approach to align the multi-modal latent space through contrastive learning as a base pre-text task before fusing them for further task such as mask language modeling and image-text matching.

Supposed I were the authors, I would approach the same way people often do in their first thought, choosing an architecture (BERT) with, first, a module to encode text and, second, a fusion module (e.g. cross attention), also, with slight modification (split number of layers half instead of tuning) that can quickly adapt to my purposes. This will help me alleviate the problem of architecture design while provide more space to discuss about the two aforementioned points. Clearly, they set aside a whole section for momentum distillation and especially a section for the "Mutual Information Maximization Perspective". This can be also shown in the way they do the ablation study where they focused on how their learning strategy impacting on model performance. These are my thoughts about why they chose that architecture.

Back to the point of architecture design, I think your idea is worth investigating because the text continues to be fed into a same architecture and image seems to be a "conditional guiding" for the text through cross attention. However, we should note that the fusion stage is not only done through the BERT layer, but is also indirectly impacted by the contrastive learning, so, you can consider this point if you need to make comparison when doing the follow-up research.

salesforce / ALBEF

Architecture of ALBEF #144