Closed auniquesun closed 1 year ago
@auniquesun, check out CLIP's paper https://arxiv.org/pdf/2103.00020.pdf, on page number 5, the top left corner. You can think in this way, it's a classic info NCE loss, within a batch of samples, you want the model to maximize the possibilities of the corresponding pairs, the corresponding pair is the positive sample, the other samples are the negative samples.
According to the paper, ULIP uses cross-modal contrastive loss to align point cloud features and image/text representations.
However, after reading the code, it seems you deploy cross entropy loss to define the loss function, e.g., the
class ULIPWithImageLoss
inmodels/losses.py
.https://github.com/salesforce/ULIP/blob/e3f61ab758b9f485a6c9b0394ecced59773393e0/models/losses.py#L48-L50
I wonder why the cross entropy loss defined here can be treated as the contrastive one? According to my understanding, they have different equations and use cases (cross entroy loss for supervised learning and contrastive loss for unsupervised learning).
It would be appreciated if you can provide more clarifications about that!