salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

Missing text-only Transformer in visualization notebook #95

Open BennoKrojer opened 1 year ago

BennoKrojer commented 1 year ago

Hi! First of all thank you for your work, it has been quite easy and performant to use so far. I am currently confused looking at the forward() method of your visualization: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb

In all other ALBEF models, also the one refcoco.pth was trained on, the text_encoder is usually used in two stages such that text is fist processed alone. However here, the whole BERT text-encoder takes images from the very first layer:

    def old_forward(self, image, text):
        image_embeds = self.visual_encoder(image) 

        image_atts = torch.ones(image_embeds.size()[:-1],dtype=torch.long).to(image.device)

        output = self.text_encoder(text.input_ids, 
                                attention_mask = text.attention_mask,
                                encoder_hidden_states = image_embeds,
                                encoder_attention_mask = image_atts,      
                                return_dict = True,
                               )     

        vl_embeddings = output.last_hidden_state[:,0,:]
        vl_output = self.itm_head(vl_embeddings)   
        return vl_output

For example, in model_retrieval.py, instead the mode is "text" and this means only the first 6 layers are used:

text_output = self.text_encoder(text.input_ids, attention_mask = text.attention_mask,                      
                                        return_dict = True, mode = 'text')

I tried the same visualization.ipynb but with a two stage forward method for the text_encoder and it gives the exact same results. Shouldn't your forward() method perform worse since refcoco.pth was not trained such that the first six layers receive image tokens?

Thank you! Benno

LiJunnan1992 commented 1 year ago

Hi, the text_encoder has three modes. The default is "multi_modal"

https://github.com/salesforce/ALBEF/blob/fb384204472feab2a85bd4f5790d7889c31672c9/models/xbert.py#L550

BennoKrojer commented 1 year ago

Hi, thank you for your quick response! I found that there are three modes but the checkpoint "refcoco.pth" we are loading was finetuned with models/model_retrieval.py, according to the paper. And here the modes used are: first "text" and then "fusion". However in the visualization.ipynb it is "multi_modal".

So how are the weights in the first six layers of the text_encoder used to take images from training? They were never asked to do that in training right?

BennoKrojer commented 1 year ago

I replaced the forward() method in the visualization to use "text" and then "fusion", with the hope and intuition it should improve or change something. But the visualizations outputs stayed exactly the same.

LiJunnan1992 commented 1 year ago

1 forward pass with 'multi_modal' is the same as 2 forward passes with 'text' and 'fusion'