Problems with Loss Function

cem-sirin commented 5 months ago

Hi there,

I was searching how people implemented CLIP and found this repo. Problems/differences with the loss function based on the CLIP Paper:

1) If you don't add L2 normalization before the dot product, you won't get cosine similarity. And, just taking the dot products is an unbounded equation, unlike the cosine similarity. 2) They use an identity matrix as the targets (if you want to think in terms of a one-hot matrix). 3) They also use symmetric cross entropy, which is a linear function of (standard) cross entropy and reverse cross entropy. This is quite important since the reverse cross entropy part provides you the negative sampling which is essential for contrastive learning. If you don't have such a mechanism, everything will collapse (you can search for dimensional collapse if you want to learn more).

Aaand on a final note: The goal of CLIP's contrastive learning is not only to align the latent space contexts. The assumption is that, via contrastive learning you will get encodings that are richer in context, which is open to debate. You mention that you have trained your model for 24 minutes on the GPU, and I think its probable that your model achieves latent space alignment and no further. Moreover, if you had trained further, you should have faced dimensional collapse due to your loss configuration.

Again thanks to OpenAI for not writing the loss function clearly, and only providing a cryptic pseudo-code. I am quite surprised that this repo is used for the conference proceedings and has ~400 stars. It kinda shows that a sizable portion of the ML community is just copy pasting code hoping that stuff will eventually work.

YichaoCai1 commented 5 months ago

@cem-sirin I partly agree with your opinion. Regarding the label term for calculating the cross entropy, both the CLIP paper and the OpenCLIP code (https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py#L92) employ hard targets. Moreover, recent work (https://openreview.net/forum?id=U_2kuqoTcB) elucidates the theoretical underpinnings of CLIP's effectiveness. Although the owner of this repo providing an explanation for the loss in the ReadMe, I remain unconvinced about the rationale to modify it.

cem-sirin commented 5 months ago

Hi @YichaoCai1, I suppose you are referring to the paragraph that starts with "Here's why I didn't use a simpler approach..." as the explanation for the loss function.

In his explanation he gives his reason for not using a hard target, and creating (let's say) soft targets, so that he does not penalize repeating images in the same batch. The problem is that these targets are now part of the backward propagation, while that was not the case for the hard targets. I can't imagine such a thing working, have you ran a long enough training session @YichaoCai1, I am curious if it works?

Also, (I skimmed the article I may be wrong) I don't think the loss function they present is meant to replicate CLIP's loss function, and they only refer to CLIP in the introduction. Anyways, I can share my loss while I was implementing CLIP-like model for something else after translate it to PyTorch, so that people can point out which exact line they disagree with.

leo4life2 commented 3 months ago

Hey @cem-sirin , thanks for posting this, I was confused and wondering if this repo's implementation of the loss function would really work as well.

About your comment on symmetric cross entropy, doesn't this repo implement symmetric cross entropy, since it's calculating loss in both directions?:

texts_loss = cross_entropy(logits, targets, reduction='none')
images_loss = cross_entropy(logits.T, targets.T, reduction='none')
loss =  (images_loss + texts_loss) / 2.0 # shape: (batch_size)

YichaoCai1 commented 3 months ago

Hi @cem-sirin,

Apologies for my delayed response. I haven't executed the code in this repository, as the modifications presented have not convinced me of their efficacy. Additionally, the paper I mentioned serves as further evidence to support my opinion that adhering to the original implementation of the symmetric InfoNCE loss is essential, a principle that, IMHO, is not followed in this repository.

wyh196646 commented 3 months ago

The loss fuction implemented in this repo really with some problemes，the model not work

moein-shariatnia / OpenAI-CLIP

Problems with Loss Function #16