Sigmoid instead of softmax used in Documentation and autopipeline for siglip

rishabh063 commented 4 months ago

System Info

documentation link :- https://huggingface.co/docs/transformers/en/model_doc/siglip

Reproduction

Sample code :-

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
# important: we pass `padding=max_length` since the model was trained with this
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Expected behavior

this line should be replaced

probs = torch.sigmoid(logits_per_image) # these are the probabilities

with

probs = logits_per_image.softmax(dim=1)

NielsRogge commented 4 months ago

Hi,

SigLIP applies sigmoid instead of softmax, that's where the name comes from ;)

rishabh063 commented 4 months ago

isn't that just for training , in inference when getting probabilities softmax would represent probabilities better ,

alternatively you can remove that comment or just make sure that it sums to 100 .

Was annoying when trying to see probabilities not adding to 100%

amyeroberts commented 4 months ago

@rishabh063 Would you like to open a PR to make this change?

rishabh063 commented 4 months ago

With the softmax one ?

NielsRogge commented 4 months ago

Applying a softmax to a model trained with sigmoid is technically not allowed - see this thread for some info from the authors.

cc @merveenoyan who had a trick to normalize the probabilities. But with sigmoid, you need to interpret them for each image-text pair independently (just like in multi-label classification)

amyeroberts commented 4 months ago

@rishabh063 I think updating the comment - just to indicate the values won't add up to 1 - would make most sense

rishabh063 commented 4 months ago

The thread shared by @NielsRogge has a re-scaling these value example by @merveenoyan .

Ig thats the best approach. Any one of you should make the pr and have a little explanation also mentioned.

Most people are familiar with clip but not with siglip .

amyeroberts commented 4 months ago

@rishabh063 Yes, that's true. You'll notice that later in the discussion @rwightman points out that this is unlikely to be properly calibrated. Rather than make an assumption about the data/task c.f. this comment it's better just to flag to users that the outputs might not add up to 1.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers