Closed Andy1621 closed 1 year ago
Since I'm not familiar with this area, another question is that, does the text encoder forward/backward twice for VTM/VTC and MLM loss?
I see that when calculating the MLM loss, the text encoder will take video embedding as keys/values, and forward/backward the text encoder again.
https://github.com/klauscc/VindLU/blob/30465487e8314fa1df45b1457b0313f25649e054/models/criterions.py#L228-L248
But in the code, the first 9 Attention Layer
is Self-Attention
, and the last 3 is Cross-Attention
. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?
Thanks for your help!
Sorry about the late.
before Spatial Attention
or after FFN
. Actually, they only have minor differences. If you move the temporal attention in the first layer to the last layer after FFN in the before spatial attention
, it becomes the after FFN
scheme. We also don't observe much difference by initializing the temporal attentions randomly or from the spatial self attentions.Thanks for your answer!
Thanks for your great job! I really appreciate the detailed experiments. However, I find some differences between the implementation and the paper:
Temporal Attention
is insertedbefore spatial attention as in TimeSformer
. But in the code, the temporal attention seems to be inserted after FFN. Is it better? https://github.com/klauscc/VindLU/blob/30465487e8314fa1df45b1457b0313f25649e054/models/backbones/beit/st_beit.py#L744-L755