Closed Andy1621 closed 1 year ago
Will it be possible only to forward the text encoder once (the first 9 layers)? Or in your experiments, it will drop the performance?
For ITC and ITM, the input text is not masked. For MLM, the input text is masked. Thus the first 9 layer outputs are different for these two cases. It is possible to use the same masked version for all the 3 losses, but we haven't experimented with it. You are welcome to try it out. We'd love to see how it goes.
Besides, in the code, the first 9 Attention Layer is Self-Attention, and the last 3 is Cross-Attention. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?
Yes, if the video embedding dimension is not of the size as text embedding, a projection layer is required.
Thanks for your reply! I have tried to use masked input for all the 3 losses, and the results are worse in the beginning.
Thanks for your reply! I have tried to use masked input for all the 3 losses, and the results are worse in the beginning.
You might need to tune the hyper-parameters since the default are set for our original training strategy.
Thanks for your suggestion!
I see that when calculating the MLM loss, the text encoder will take video embedding as keys/values, and forward/backward the text encoder again. https://github.com/jayleicn/singularity/blob/bf4a86ec7506565d1f6805ee1612aa6029592776/models/model_pretrain.py#L22-L52
Will it be possible only to forward the text encoder once (the first 9 layers)? Or in your experiments, it will drop the performance?
Besides, in the code, the first 9
Attention Layer
isSelf-Attention
, and the last 3 isCross-Attention
. If the model is reused, the dimension of video embedding should be the same as the text embedding. If not (e.g., BeiT-L for visual encoder), should we add another projection layer to downsample the video embedding?Thanks for your help!