SigLIP logits - Githubissues

mlfoundations / open_clip

An open source implementation of CLIP.

Other

9.93k stars 959 forks source link

SigLIP logits #716

Closed Gasp34 closed 11 months ago

Gasp34 commented 11 months ago

Hello,

I am having troubles to compute the logits using the SigLIP model, I am trying the following:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-SO400M-14-SigLIP', pretrained='webli')
tokenizer = open_clip.get_tokenizer('ViT-SO400M-14-SigLIP')

image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    logits = model.logit_scale.exp() * image_features @ text_features.T + model.logit_bias
    probs = 1/(1 + torch.exp(-logits))
print(logits, probs)

gives tensor([[ -1.2487, -8.8779, -10.9327]]) tensor([[2.2292e-01, 1.3942e-04, 1.7864e-05]])

But I think something is wrong because 0.22 is too low. What is the right way ?

gabrielilharco commented 11 months ago

Hi @Gasp34 , you may want to compute the probabilities like this:

probs = logits.softmax(dim=-1)

This gives me:

tensor([[9.9945e-01, 4.8584e-04, 6.2240e-05]])

Gasp34 commented 11 months ago

I thought the purpose of SigLIP was not to use softmax. Because SigLIP is trained with the sigmoid loss, I want to predict the same probabilities SigLIP has learned to predict

Am I using the scale and bias correctly ?

gabrielilharco commented 11 months ago

Ah, you're right. Not sure why the probs are so low then. Maybe @rwightman has thoughts

gabrielilharco commented 11 months ago

@Gasp34 fwiw I get much higher probs with your code for this image https://cdn.openai.com/multimodal-neurons/assets/apple/apple-ipod.jpg and the prompts 'an apple', 'a picture of an apple', 'an ipod', 'an apple with a note saying "ipod"' (probs are 0.2782, 0.8341, 0.5088, 0.9996)

Gasp34 commented 11 months ago

interesting! maybe the text needs to be much more precise to get a good score

rwightman commented 11 months ago

You're going to see lower probs with sigmoid, with softmax the probs are forced to sum to 1. With sigmoid each is independent so the 'uncertainty' will be much more evident. Also feel it's generally more meaningful than softmax for zero-shot, with softmax if you pick 3 classes that don't apply to the picture, one of them could end up at 99% because they have to sum to 1! You get no indication if the model isn't really sure and doesn't think any of those 3 apply

FYI the examples on the HF hub were updated to use the sigmoid prob calculation https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384#with-openclip

Gasp34 commented 11 months ago

Thanks for the answer ! I feel that softmax is a problem exactly for this reason ! Also with softmax, adding a new class will change the score of the other classes...