question regarding the reco loss

seyeeet commented 2 years ago

Hi Thank you very much for sharing the code for this amazing work.

I have some naïve questions regarding the choice of design for the reco loss. 1) is there any specific reason that you put Lines 130-136 under the torch.no_grad()? I appreciate it if you can share the insight behind such a decision. 2) based on your experiments/experience how the performance will change if we change the temperature to a smaller (0.05) or larger (1-2) values? 3) what would be the strategy behind selecting the number for negatives samples and queries, e.g. you have 512 for negative and 256 for query, what will happen if we change those and if there is a rule that we should keep in mind while playing with those numbers.

4) when I am using fully supervised model, reco loss still improve the results, my question is: is it necessary to have the projection layer in the fully supervised setting? can you explain the reason in either cases please. To be specific, I was not sure why we still need to use self.representation when we are doing the fully supervised case and why not use the features before we give them to the classifier for that (i.e. x = self.resnet_layer4(x))

Thanks a lot for your insight and explanation in advance :)

lorenmt commented 2 years ago

I will suggest reading MoCo: https://arxiv.org/pdf/1911.05722.pdf. This gives you an intuition why having no gradients on the keys is important.
Larger temperature, meaning more balanced of the class distribution when doing negative key sampling, which will lead to degraded performance.
Please read the paper on ablation.
Projection layer is important, otherwise, we have label prediction and features are highly correlated which leads to unstable training.

seyeeet commented 2 years ago

Thank you, I will check the moco paper and ablation study

lorenmt / reco

question regarding the reco loss #13