Closed nikky4D closed 3 years ago
Sorry about the confusion; you could read "we remove" as "we do not use". The projection is linear during training and evaluation.
Looks like we're missing a preposition in that sentence. The speculation was that nonlinear projections might be effective only in self-supervised methods but not in contrastive multimodal settings like CLIP.
The loss function takes the same form as ConVIRT by Zhang et al. (2020).
Thanks for the update. One last question to clarify: Your linear function is the weight matrix, the W_i, W_t
variable in the pseudocode?
Thank you very much. I appreciate it.
I'm confused as to your training loss and setup.
For the setup, you say:
Do you use the non-linear projection in pretraining, then remove it after training as in Chen, 2020b, replacing it with a linear projection after training? or do you use a linear projection in pretraining, and keep the linear projection after training? And can you explain the speculation you talk about above? I don't understand what you mean there.
For the loss: Can you clarify that the loss used in training is the same form of the loss function as in Zheng et. al, 2021?