SigLIP quality degradation

airogachev commented 9 months ago

I tune ViT-B-32 using SigLip implementation. The loss values decreases but if I check the quality of img-to-text matching on COCO/CrossModal r@k for any k decreases to zero. And I can't figure out whether it happens to some issues during training or whether I have some problems wit the evaluation.

Recall calculations are performed using scores = emb1 @ embed2.T. Basically, it seems that scale and bias should not effect the performance as we take top_k results, bases on scores. But I'm a bit confused by the discussion here: https://github.com/mlfoundations/open_clip/issues/716

If I change the way we calculate loss and logits from the concept of neighbour_exchange_bidir_with_grad to the same idea that is used in case of vanilla loss:

def get_ground_truth(self, device, dtype, num_logits, negative_only=False) -> torch.Tensor:
    labels = -torch.ones((self.world_size * num_logits, self.world_size * num_logits), device=device, dtype=dtype)
    if not negative_only:
        labels = 2 * torch.eye(self.world_size * num_logits, device=device, dtype=dtype) + labels
    return labels

def get_logits(self, image_features, text_features, logit_scale, logit_bias=None):
    if self.world_size > 1:
        all_image_features, all_text_features = gather_features(
            image_features, text_features,
            False, self.gather_with_grad, self.rank, self.world_size, self.use_horovod)       
        logits = logit_scale * all_image_features @ all_text_features.T
    else:
        logits = logit_scale * image_features @ text_features.T

    if logit_bias is not None:
        logits += logit_bias
    return logits

def _loss(self, image_features, text_features, logit_scale, logit_bias=None, negative_only=False):
    logits = self.get_logits(image_features, text_features, logit_scale, logit_bias)
    labels = self.get_ground_truth(
        image_features.device,
        image_features.dtype,
        image_features.shape[0],
        negative_only=negative_only,
    )
    loss = -F.logsigmoid(labels * logits).sum() / image_features.shape[0]
    return loss

the quality doesn't decrease, but the results are not better than in case of vanilla loss.

So, I'd like to get the idea whether the quality evaluation procedure should be changed or whether there are some issues with multi-gpu setting in case of siglip

rom1504 commented 9 months ago

Fine tuning with a contrastive loss doesn't usually work by default (with clip or siglip loss). See other issues on the same topic

On Sun, Nov 19, 2023, 10:29 Alexander Rogachev @.***> wrote:

I tune ViT-B-32 using SigLip implementation. The loss values decreases but if I check the quality of img-to-text matching on COCO/CrossModal @.*** for any k decreases to zero. And I can't figure out whether it happens to some issues during training or whether I have some problems wit the evaluation.

1.

Recall calculations are performed using scores = emb1 @ embed2.T. Basically, it seems that scale and bias should not effect the performance as we take top_k results, bases on scores. But I'm a bit confused by the discussion here: #716 https://github.com/mlfoundations/open_clip/issues/716 2.

If I change the way we calculate loss and logits from the concept of neighbour_exchange_bidir_with_grad to the same idea that is used in case of vanilla loss:

def get_ground_truth(self, device, dtype, num_logits, negative_only=False) -> torch.Tensor: labels = -torch.ones((self.world_size num_logits, self.world_size num_logits), device=device, dtype=dtype) if not negative_only: labels = 2 torch.eye(self.world_size num_logits, device=device, dtype=dtype) + labels return labels
def get_logits(self, image_features, text_features, logit_scale, logit_bias=None):
    if self.world_size > 1:
        all_image_features, all_text_features = gather_features(
            image_features, text_features,
            False, self.gather_with_grad, self.rank, self.world_size, self.use_horovod)
        logits = logit_scale * all_image_features @ all_text_features.T
    else:
        logits = logit_scale * image_features @ text_features.T

    if logit_bias is not None:
        logits += logit_bias
    return logits

def _loss(self, image_features, text_features, logit_scale, logit_bias=None, negative_only=False):
    logits = self.get_logits(image_features, text_features, logit_scale, logit_bias)
    labels = self.get_ground_truth(
        image_features.device,
        image_features.dtype,
        image_features.shape[0],
        negative_only=negative_only,
    )
    loss = -F.logsigmoid(labels * logits).sum() / image_features.shape[0]
    return loss
the quality doesn't decrease, but the results are not better than in case of vanilla loss.

So, I'd like to get the idea whether the quality evaluation procedure should be changed or whether there are some issues with multi-gpu setting in case of siglip

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/issues/742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TTHT6OGFEEMON2Y6LYFHGQFAVCNFSM6AAAAAA7RTLINKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDANZXGA4DONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

airogachev commented 9 months ago

See other issues on the same topic

Could you please provide any reference that you mention? It's not dat clear why should img2text and text2img matching that is performed on some data that is not related to tuning should be so dramatically awful compared to the results that I achieve using vanilla loss. I don't face the same problem if I start from pretrained weights with vanilla loss.

rom1504 commented 9 months ago

For example https://github.com/mlfoundations/open_clip/issues/740

airogachev commented 9 months ago

For example #740

Thanks. The weight-decay related trick from the original paper should be definitely applied on practice, I'd try to play around it and report the results. However, the tuned version with WD doesn't seem to be as bad as my one. :^(

airogachev commented 9 months ago

Just to verify the problem with the similarity manually, I used this notebook to calculate probs using tuned model and visualize them. Calculating it like this: similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T gives the follwing results output

If I change the way I calculate similarity according to https://github.com/mlfoundations/open_clip/issues/716 similarity = torch.sigmoid(text_features.cpu() @ image_features.cpu().T * model.logit_scale.exp().detach() + model.logit_bias.detach()).numpy() all the results are near zero output1

Thus, it seems to me that there are some problems related to the fitting procedure, not the way of evaluation.

P.S. It's not clear why do we use model.logit_scale.exp() with exp() as we don't use it inside loss calculations

airogachev commented 9 months ago

I tried to print the losses during training and noticed the following point. Here is a straightforward code to print losses:

                res = []
                for i in range(self.world_size - 1):
                    text_features_from_left = neighbour_exchange_with_grad(
                        left_rank, right_rank, text_features_to_right)

                    l = self._loss(
                        image_features,
                        text_features_from_left,
                        logit_scale,
                        logit_bias,
                        negative_only=True,                
                    )
                    loss += l
                    res.append(l.item())
                    text_features_to_right = text_features_from_left

                print(res, 'Rank = ', self.rank)

And here are the results that I got:

[354.88958740234375, 354.88958740234375, 354.88958740234375] Rank =  2
-------------------------------
[5677.8037109375, 354.88958740234375, 354.88958740234375] Rank =  1
-------------------------------
[354.88958740234375, 354.88958740234375, 354.88958740234375] Rank =  0
-------------------------------
[5458.8837890625, 5226.5947265625, 4893.4765625] Rank =  3
-------------------------------

[4701.7587890625, 5425.28125, 354.8894958496094] Rank =  2
-------------------------------
[5149.31640625, 354.8894958496094, 354.8894958496094] Rank =  1
-------------------------------
[354.8894958496094, 354.8894958496094, 354.8894958496094] Rank =  0
-------------------------------
[5870.8134765625, 4951.73388671875, 5743.1943359375] Rank =  3

So, basically, I would expect the same losses to appear in cases with only negative targets, but they appear on the same device, that seems confusing

airogachev commented 9 months ago

Looks like I found the reason

Loss :  [7671.5458984375, 354.891357421875] 
features =  [array([ 0.01470058,  0.01489793,  0.01489793, -0.02232978, -0.00825857,
        0.01698093, -0.00424272, -0.0476504 , -0.01017203,  0.00046863],
      dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)] 
Rank =  1

Loss :  [ 354.891357421875, 354.891357421875] 
features =  [array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)] 
Rank =  0
-------------------------------
Loss :  [9788.48046875, 7287.185546875] 
features =  [array([ 0.00865454, -0.0133395 ,  0.03401694, -0.04332845, -0.00871346,
       -0.00443785,  0.0194423 ,  0.02505944,  0.0317088 ,  0.0160445 ],
      dtype=float32), array([ 0.01470058,  0.01489793,  0.01489793, -0.02232978, -0.00825857,
        0.01698093, -0.00424272, -0.0476504 , -0.01017203,  0.00046863],
      dtype=float32)] 
Rank =  2

We face the same loss in cases when we have "zeroed" features. I guess it happens as here the tensor is initialized as zeroes.

But I'm not sure why the function itself returns zeroes and places them on gpus after exchange.

airogachev commented 9 months ago

Just to sum up, looks like the problem is related to neighbour_exchange that returns zeroes, but I have no ideas so far how to fix it

airogachev commented 9 months ago

The issue was introduced by torch and it can be solved updating torch to 1.13 😢

mlfoundations / open_clip

SigLIP quality degradation #742