Closed seyeeet closed 2 years ago
First of all, the equation and code are consistent.
In more details, in the code, we have all_feat, which is 256x513x256, so for each sample the first one is the positive and the rest are the negatives. The seg_logits then will have 256x513 dimension. We then compute the similarity via cosine_similarity (which I think is different from the contrastive loss in eq 1).
Up to here, your understanding is correct.
Cosine similarity is the same as normalised dot product, it is a way to compute the distance from two vectors. Cosine similarity is typically used more often because it has a bounded distance from -1 to 1 (completely different direction, to the same direction); this is correspond to the formulation: r_q \cdot c_k / \tau.
Cross entropy is then the same as the negative log-likehood loss: cross_entropy(pred, gt) = - gt * log(pred). Here pred and gt are both probability distribution, which we used softmax to achieve, the F.cross_entropy
has softmax operation built in the function, so that how -log exp(..) / exp(..) + exp(..) comes from. I would suggest to further dive into the documentation of pytorch for a better understanding.
Thank you very much for the explanation, can you please tell me what part of the code corresponds to the numerator of equation 1 and what part of the code correspond to the denominator of the equation 1?
That is the softmax embedded in the cross_entropy function: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
oh I see, that makes things very clear, thank you!
I am kinda new to contrast learning and although I completely follow the math, I have a confusion when I am looking at the code (by code I mean the general code, not only your code :) ).
So in the paper Eq.1 is the reco loss, which is the pixel wise contrastive loss, but in the code here I cannot understand why it is computed different from the way that Eq1 is written, is it a common thing in practice?
In more details, in the code, we have
all_feat
, which is256x513x256
, so for each sample the first one is the positive and the rest are the negatives. Theseg_logits
then will have256x513
dimension. We then compute the similarity viacosine_similarity
(which I think is different from the contrastive loss in eq 1).Then, in the next step based on the
F.cross_entropy
we want all of them to have label zeros, because they have to be the same as the positive one. My confusion is that why we dont use Eq1 as it is written in the paper and why we useF.cross_entropy
. I feel this is not exactly the same Eq1. Can you please help me understand the relation?