Link between the center loss and conditional cross entropy

jeromerony / dml_cross_entropy

Code for the paper "A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses" (ECCV 2020 - Spotlight)

https://arxiv.org/abs/2003.08983

BSD 3-Clause "New" or "Revised" License

167 stars 18 forks source link

Link between the center loss and conditional cross entropy #9

Closed needylove closed 2 years ago

needylove commented 2 years ago

Dear authors,

Thanks for your wonderful work. I really like it, yet I got confused about why the center loss can be interpreted as a conditional entropy between \hat{Z} and \bar Z.

Might I have your kind reply. Thanks.

mboudiaf commented 2 years ago

Hey @needylove,

Thanks for your interest. That's a good question. It's been a while, but I think the following proof is correct.

Let me know if anything seems unclear. Best.

needylove commented 2 years ago

Dear @mboudiaf,

Thanks for your kind reply! Just one more question: Why is the sampling from the feature distribution equal to the sum over all the z_i where yi=k. As we do not know the concrete distribution of \hat{Z} | Y, we may not be able to sample \hat{Z} directly. Besides, \sum{i:y_i=k} seems to contain all the data belonging to class k, rather than a sampled part.

Thanks. Best.

mboudiaf commented 2 years ago

HI @needylove,

We do not know the true density of \hat{Z} | Y=k, but we can sample from it (this simply corresponds to extracting features for images from class k). The sum over the "i" should be understood as "you sum over all the indices i in the current batch of samples, such that y_i = k". The expectation is therefore replaced by this sum through a Monte-Carlo empirical estimate. Is it clearer ? Best ! I

needylove commented 2 years ago

Dear @mboudiaf,

Much clearer with your kind help. Yet, I lost the link between the distribution of the input X and \hat{Z} | Y=k. It seems a batch of samples is sampled from the distribution of X, although each sampled x corresponds to a \hat{Z}, yet, I do not understand why we can treat the corresponding \hat{Z} as a sample. In another word, assume x' \in X has a higher probability to be sampled and thus the corresponding \hat{z'} also has a higher probability to be sampled, yet, \hat{z'} may not has a high probability to be sampled under the distribution \hat{Z} | Y=k.

Thanks! Best.