Why does the text encoder forward/backward twice for VTM/VTC and MLM loss?

jayleicn / singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"

https://arxiv.org/abs/2206.03428

MIT License

129 stars 13 forks source link

Why does the text encoder forward/backward twice for VTM/VTC and MLM loss? #24

Closed Andy1621 closed 1 year ago

Andy1621 commented 1 year ago

I see that when calculating the MLM loss, the text encoder will take video embedding as keys/values, and forward/backward the text encoder again. https://github.com/jayleicn/singularity/blob/bf4a86ec7506565d1f6805ee1612aa6029592776/models/model_pretrain.py#L22-L52

Will it be possible only to forward the text encoder once (the first 9 layers)? Or in your experiments, it will drop the performance?

Besides, in the code, the first 9 Attention Layer is Self-Attention, and the last 3 is Cross-Attention. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?

Thanks for your help!

jayleicn commented 1 year ago

Will it be possible only to forward the text encoder once (the first 9 layers)? Or in your experiments, it will drop the performance?

For ITC and ITM, the input text is not masked. For MLM, the input text is masked. Thus the first 9 layer outputs are different for these two cases. It is possible to use the same masked version for all the 3 losses, but we haven't experimented with it. You are welcome to try it out. We'd love to see how it goes.

Besides, in the code, the first 9 Attention Layer is Self-Attention, and the last 3 is Cross-Attention. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?

Yes, if the video embedding dimension is not of the size as text embedding, a projection layer is required.

Andy1621 commented 1 year ago

Thanks for your reply! I have tried to use masked input for all the 3 losses, and the results are worse in the beginning.

jayleicn commented 1 year ago

Thanks for your reply! I have tried to use masked input for all the 3 losses, and the results are worse in the beginning.

You might need to tune the hyper-parameters since the default are set for our original training strategy.

Andy1621 commented 1 year ago

Thanks for your suggestion!