Closed airogachev closed 9 months ago
Fine tuning with a contrastive loss doesn't usually work by default (with clip or siglip loss). See other issues on the same topic
On Sun, Nov 19, 2023, 10:29 Alexander Rogachev @.***> wrote:
I tune ViT-B-32 using SigLip implementation. The loss values decreases but if I check the quality of img-to-text matching on COCO/CrossModal @.*** for any k decreases to zero. And I can't figure out whether it happens to some issues during training or whether I have some problems wit the evaluation.
1.
Recall calculations are performed using scores = emb1 @ embed2.T. Basically, it seems that scale and bias should not effect the performance as we take top_k results, bases on scores. But I'm a bit confused by the discussion here: #716 https://github.com/mlfoundations/open_clip/issues/716 2.
If I change the way we calculate loss and logits from the concept of neighbour_exchange_bidir_with_grad to the same idea that is used in case of vanilla loss:
def get_ground_truth(self, device, dtype, num_logits, negative_only=False) -> torch.Tensor: labels = -torch.ones((self.world_size num_logits, self.world_size num_logits), device=device, dtype=dtype) if not negative_only: labels = 2 torch.eye(self.world_size num_logits, device=device, dtype=dtype) + labels return labels
def get_logits(self, image_features, text_features, logit_scale, logit_bias=None): if self.world_size > 1: all_image_features, all_text_features = gather_features( image_features, text_features, False, self.gather_with_grad, self.rank, self.world_size, self.use_horovod) logits = logit_scale * all_image_features @ all_text_features.T else: logits = logit_scale * image_features @ text_features.T if logit_bias is not None: logits += logit_bias return logits def _loss(self, image_features, text_features, logit_scale, logit_bias=None, negative_only=False): logits = self.get_logits(image_features, text_features, logit_scale, logit_bias) labels = self.get_ground_truth( image_features.device, image_features.dtype, image_features.shape[0], negative_only=negative_only, ) loss = -F.logsigmoid(labels * logits).sum() / image_features.shape[0] return loss
the quality doesn't decrease, but the results are not better than in case of vanilla loss.
So, I'd like to get the idea whether the quality evaluation procedure should be changed or whether there are some issues with multi-gpu setting in case of siglip
— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TTHT6OGFEEMON2Y6LYFHGQFAVCNFSM6AAAAAA7RTLINKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDANZXGA4DONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
See other issues on the same topic
Could you please provide any reference that you mention? It's not dat clear why should img2text and text2img matching that is performed on some data that is not related to tuning should be so dramatically awful compared to the results that I achieve using vanilla loss. I don't face the same problem if I start from pretrained weights with vanilla loss.
For example #740
Thanks. The weight-decay related trick from the original paper should be definitely applied on practice, I'd try to play around it and report the results. However, the tuned version with WD doesn't seem to be as bad as my one. :^(
Just to verify the problem with the similarity manually, I used this notebook to calculate probs using tuned model and visualize them.
Calculating it like this:
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T
gives the follwing results
If I change the way I calculate similarity according to https://github.com/mlfoundations/open_clip/issues/716
similarity = torch.sigmoid(text_features.cpu() @ image_features.cpu().T * model.logit_scale.exp().detach() + model.logit_bias.detach()).numpy()
all the results are near zero
Thus, it seems to me that there are some problems related to the fitting procedure, not the way of evaluation.
P.S. It's not clear why do we use model.logit_scale.exp()
with exp() as we don't use it inside loss calculations
I tried to print the losses during training and noticed the following point. Here is a straightforward code to print losses:
res = []
for i in range(self.world_size - 1):
text_features_from_left = neighbour_exchange_with_grad(
left_rank, right_rank, text_features_to_right)
l = self._loss(
image_features,
text_features_from_left,
logit_scale,
logit_bias,
negative_only=True,
)
loss += l
res.append(l.item())
text_features_to_right = text_features_from_left
print(res, 'Rank = ', self.rank)
And here are the results that I got:
[354.88958740234375, 354.88958740234375, 354.88958740234375] Rank = 2
-------------------------------
[5677.8037109375, 354.88958740234375, 354.88958740234375] Rank = 1
-------------------------------
[354.88958740234375, 354.88958740234375, 354.88958740234375] Rank = 0
-------------------------------
[5458.8837890625, 5226.5947265625, 4893.4765625] Rank = 3
-------------------------------
[4701.7587890625, 5425.28125, 354.8894958496094] Rank = 2
-------------------------------
[5149.31640625, 354.8894958496094, 354.8894958496094] Rank = 1
-------------------------------
[354.8894958496094, 354.8894958496094, 354.8894958496094] Rank = 0
-------------------------------
[5870.8134765625, 4951.73388671875, 5743.1943359375] Rank = 3
So, basically, I would expect the same losses to appear in cases with only negative targets, but they appear on the same device, that seems confusing
Looks like I found the reason
Loss : [7671.5458984375, 354.891357421875]
features = [array([ 0.01470058, 0.01489793, 0.01489793, -0.02232978, -0.00825857,
0.01698093, -0.00424272, -0.0476504 , -0.01017203, 0.00046863],
dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)]
Rank = 1
Loss : [ 354.891357421875, 354.891357421875]
features = [array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)]
Rank = 0
-------------------------------
Loss : [9788.48046875, 7287.185546875]
features = [array([ 0.00865454, -0.0133395 , 0.03401694, -0.04332845, -0.00871346,
-0.00443785, 0.0194423 , 0.02505944, 0.0317088 , 0.0160445 ],
dtype=float32), array([ 0.01470058, 0.01489793, 0.01489793, -0.02232978, -0.00825857,
0.01698093, -0.00424272, -0.0476504 , -0.01017203, 0.00046863],
dtype=float32)]
Rank = 2
We face the same loss in cases when we have "zeroed" features. I guess it happens as here the tensor is initialized as zeroes.
But I'm not sure why the function itself returns zeroes and places them on gpus after exchange.
Just to sum up, looks like the problem is related to neighbour_exchange
that returns zeroes, but I have no ideas so far how to fix it
The issue was introduced by torch and it can be solved updating torch to 1.13 😢
I tune ViT-B-32 using SigLip implementation. The loss values decreases but if I check the quality of img-to-text matching on COCO/CrossModal r@k for any k decreases to zero. And I can't figure out whether it happens to some issues during training or whether I have some problems wit the evaluation.
Recall calculations are performed using
scores = emb1 @ embed2.T
. Basically, it seems that scale and bias should not effect the performance as we take top_k results, bases on scores. But I'm a bit confused by the discussion here: https://github.com/mlfoundations/open_clip/issues/716If I change the way we calculate loss and logits from the concept of
neighbour_exchange_bidir_with_grad
to the same idea that is used in case of vanilla loss:the quality doesn't decrease, but the results are not better than in case of vanilla loss.
So, I'd like to get the idea whether the quality evaluation procedure should be changed or whether there are some issues with multi-gpu setting in case of siglip