Open BennoKrojer opened 1 year ago
Hi, the text_encoder has three modes. The default is "multi_modal"
Hi, thank you for your quick response! I found that there are three modes but the checkpoint "refcoco.pth" we are loading was finetuned with models/model_retrieval.py
, according to the paper. And here the modes used are: first "text" and then "fusion". However in the visualization.ipynb it is "multi_modal".
So how are the weights in the first six layers of the text_encoder used to take images from training? They were never asked to do that in training right?
I replaced the forward() method in the visualization to use "text" and then "fusion", with the hope and intuition it should improve or change something. But the visualizations outputs stayed exactly the same.
1 forward pass with 'multi_modal' is the same as 2 forward passes with 'text' and 'fusion'
Hi! First of all thank you for your work, it has been quite easy and performant to use so far. I am currently confused looking at the forward() method of your visualization: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb
In all other ALBEF models, also the one refcoco.pth was trained on, the text_encoder is usually used in two stages such that text is fist processed alone. However here, the whole BERT text-encoder takes images from the very first layer:
For example, in model_retrieval.py, instead the mode is "text" and this means only the first 6 layers are used:
I tried the same visualization.ipynb but with a two stage forward method for the text_encoder and it gives the exact same results. Shouldn't your forward() method perform worse since refcoco.pth was not trained such that the first six layers receive image tokens?
Thank you! Benno