Clarifying the training setup

nikky4D commented 3 years ago

I'm confused as to your training loss and setup.

For the setup, you say:

We remove the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only self supervised representation learning methods.

Do you use the non-linear projection in pretraining, then remove it after training as in Chen, 2020b, replacing it with a linear projection after training? or do you use a linear projection in pretraining, and keep the linear projection after training? And can you explain the speculation you talk about above? I don't understand what you mean there.

For the loss: Can you clarify that the loss used in training is the same form of the loss function as in Zheng et. al, 2021?

jongwook commented 3 years ago

Sorry about the confusion; you could read "we remove" as "we do not use". The projection is linear during training and evaluation.

Looks like we're missing a preposition in that sentence. The speculation was that nonlinear projections might be effective only in self-supervised methods but not in contrastive multimodal settings like CLIP.

The loss function takes the same form as ConVIRT by Zhang et al. (2020).

nikky4D commented 3 years ago

Thanks for the update. One last question to clarify: Your linear function is the weight matrix, the W_i, W_tvariable in the pseudocode?

jongwook commented 3 years ago

Yes! That corresponds to here, here, and here in the model code.

nikky4D commented 3 years ago

Thank you very much. I appreciate it.

openai / CLIP

Clarifying the training setup #42