openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.66k stars 3.29k forks source link

Input Negative Examples #344

Open cucumberguagua opened 1 year ago

cucumberguagua commented 1 year ago

In the Figure 1 of CLIP paper, it showed that all input pairs are treated as positive examples (diagonal in the matrix), and all unseen (text, image) pairs are treated as negatives. Can we feed in an input matrix with custom positive and negative examples (where positives are not necessarily diagonal only)? Also, may I ask where in the code calculate the contrastive loss and optimize for it?

jongwook commented 1 year ago

Can we feed in an input matrix with custom positive and negative examples (where positives are not necessarily diagonal only)?

It's possible, as long as you feed the data and labels accordingly.

Also, may I ask where in the code calculate the contrastive loss and optimize for it?

This repository doesn't have it since it only provides inference code, but see #83 and this HuggingFace blog for more guidance for training/fine-tuning CLIP.

cucumberguagua commented 1 year ago

Thanks! In #83 , labels (ground_truth) are fed in as a constant "diagonal vector" as below ground_truth = torch.arange(len(images),dtype=torch.long,device=device), and raw logits_per_image and logits_per_text are fed into the total loss function. If we have our own labels/ground truth vector, is something like below the correct way to feed labels into the loss function?

logits_per_image, logits_per_text = self.model(feats, sent)
logit_per_text_softmax = logits_per_text.softmax(dim=-1)
logit_text = torch.diagonal(logit_per_text_softmax)
logit_per_image_softmax = logits_per_image.softmax(dim=-1)
logit_image = torch.diagonal(logit_per_image_softmax)
loss = (loss_img(logit_image, label) + loss_text(logit_text, label))/2 
jongwook commented 1 year ago

It appears the code is still assuming the diagonal entries are the correct labels (from torch.diagonal). You could consider flattening the logits and use binary cross entropy for every positive/negative pairs, or create a custom cross entropy function with the target probability distribution is uniform over the positive labels.

cucumberguagua commented 1 year ago

Yes, I converted both logit_per_text_softmax and logit_per_image_softmax to diagonal on purpose, since our label is a vector. e.g. If labels = [1, 0, 0, 0, 1, 1] where each entry means whether (text_i, image_i) match or not, since original logits_per_image and logits_per_text are both 6*6 matrix, so I only selected its diagonal representing the predictions for the similarity scores between (text_i, image_i), and throw out those (text_i, image_j) pairs. Is this a correct understanding?

jongwook commented 1 year ago

For contrastive learning, you would be comparing all possible pairs of text and images (6x6=36 pairs, in your case), not just 6 pairs that happen to have the same index. The idea is to learn more from all possible pairs (i.e. by not throwing out text_i, image_j pairs), and for that you'd need 6x6 labels. This is diagonal in the default setting, but it can be in any form.

cucumberguagua commented 1 year ago

So in the 6*6 binary label matrix, if there are ones (positive examples) on non-diagonal entries (i.e. for one row, there are multiple columns that have positive examples), then we need to separate these positive examples to different batches to avoid contradicting pairs as in this comment ?

jongwook commented 1 year ago

What vinson2233 suggested in the comment is simpler, and it might be easier to do so because you'd be able to keep using the same loss formulation, as long as you load the batches correctly. (i.e. without having to worry about binary or custom cross entropy functions)