Open cucumberguagua opened 1 year ago
Can we feed in an input matrix with custom positive and negative examples (where positives are not necessarily diagonal only)?
It's possible, as long as you feed the data and labels accordingly.
Also, may I ask where in the code calculate the contrastive loss and optimize for it?
This repository doesn't have it since it only provides inference code, but see #83 and this HuggingFace blog for more guidance for training/fine-tuning CLIP.
Thanks! In #83 , labels (ground_truth) are fed in as a constant "diagonal vector" as below ground_truth = torch.arange(len(images),dtype=torch.long,device=device)
, and raw logits_per_image
and logits_per_text
are fed into the total loss function.
If we have our own labels/ground truth vector, is something like below the correct way to feed labels into the loss function?
logits_per_image, logits_per_text = self.model(feats, sent)
logit_per_text_softmax = logits_per_text.softmax(dim=-1)
logit_text = torch.diagonal(logit_per_text_softmax)
logit_per_image_softmax = logits_per_image.softmax(dim=-1)
logit_image = torch.diagonal(logit_per_image_softmax)
loss = (loss_img(logit_image, label) + loss_text(logit_text, label))/2
It appears the code is still assuming the diagonal entries are the correct labels (from torch.diagonal
). You could consider flattening the logits and use binary cross entropy for every positive/negative pairs, or create a custom cross entropy function with the target probability distribution is uniform over the positive labels.
Yes, I converted both logit_per_text_softmax
and logit_per_image_softmax
to diagonal on purpose, since our label
is a vector. e.g. If labels = [1, 0, 0, 0, 1, 1] where each entry means whether (text_i, image_i) match or not, since original logits_per_image
and logits_per_text
are both 6*6 matrix, so I only selected its diagonal representing the predictions for the similarity scores between (text_i, image_i), and throw out those (text_i, image_j) pairs.
Is this a correct understanding?
For contrastive learning, you would be comparing all possible pairs of text and images (6x6=36 pairs, in your case), not just 6 pairs that happen to have the same index. The idea is to learn more from all possible pairs (i.e. by not throwing out text_i, image_j pairs), and for that you'd need 6x6 labels. This is diagonal in the default setting, but it can be in any form.
So in the 6*6 binary label matrix, if there are ones (positive examples) on non-diagonal entries (i.e. for one row, there are multiple columns that have positive examples), then we need to separate these positive examples to different batches to avoid contradicting pairs as in this comment ?
What vinson2233 suggested in the comment is simpler, and it might be easier to do so because you'd be able to keep using the same loss formulation, as long as you load the batches correctly. (i.e. without having to worry about binary or custom cross entropy functions)
In the Figure 1 of CLIP paper, it showed that all input pairs are treated as positive examples (diagonal in the matrix), and all unseen (text, image) pairs are treated as negatives. Can we feed in an input matrix with custom positive and negative examples (where positives are not necessarily diagonal only)? Also, may I ask where in the code calculate the contrastive loss and optimize for it?