Two questions about the implementation

Andy1621 commented 1 year ago

Thanks for your great job! I really appreciate the detailed experiments. However, I find some differences between the implementation and the paper:

As in https://github.com/klauscc/VindLU/issues/1, the MVM loss is not used. Does MVM really help?
In the original paper, Temporal Attention is inserted before spatial attention as in TimeSformer. But in the code, the temporal attention seems to be inserted after FFN. Is it better? https://github.com/klauscc/VindLU/blob/30465487e8314fa1df45b1457b0313f25649e054/models/backbones/beit/st_beit.py#L744-L755

Andy1621 commented 1 year ago

Since I'm not familiar with this area, another question is that, does the text encoder forward/backward twice for VTM/VTC and MLM loss? I see that when calculating the MLM loss, the text encoder will take video embedding as keys/values, and forward/backward the text encoder again. https://github.com/klauscc/VindLU/blob/30465487e8314fa1df45b1457b0313f25649e054/models/criterions.py#L228-L248 But in the code, the first 9 Attention Layer is Self-Attention, and the last 3 is Cross-Attention. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?

Thanks for your help!

klauscc commented 1 year ago

Sorry about the late.

MVM: yes. MVM can slightly improve the performance (~1%). However, it will slow down the training speed by about 40%. This is because when adding MVM, we need one more forward in each iteration, i.e. feeding the masked videos as input to predict the masked tokens. Considering the tradeoff between accuracy and computational budget, we didn't use MVM on our final model. We will update the draft soon.
We observe similar performance by placing the temporal attention before Spatial Attentionor after FFN. Actually, they only have minor differences. If you move the temporal attention in the first layer to the last layer after FFN in the before spatial attention, it becomes the after FFN scheme. We also don't observe much difference by initializing the temporal attentions randomly or from the spatial self attentions.
Yes. During MLM the text encoder is forwarded again but with masked text input. For MVM we also need to forward the video encoder again. Forwarding video encoder is much expensive than forwarding the text encoder.
Yes, we need to add one projection layer to make dimension of video embedding and text embedding the same.

Andy1621 commented 1 year ago

Thanks for your answer!

klauscc / VindLU

Two questions about the implementation #3