huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.03k stars 27.02k forks source link

google/siglip-so400m-patch14-384 inference output mismatch with pipeline output #30951

Closed aliencaocao closed 4 months ago

aliencaocao commented 5 months ago

System Info

Who can help?

@amyeroberts

Information

Tasks

Reproduction

Using sample code from https://huggingface.co/google/siglip-so400m-patch14-384:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

The output mismatches with the pipeline approach:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

and the difference is massive. Pipeline approach seem to give the right results, which also align with inference API.

Expected behavior

Correct result for the manual approach

jla524 commented 5 months ago

I found that the pipeline uses a prompt, which adds "This is a photo of " before every candidate label.

For example, when we pass in the label "2 cats", the pipeline converts it to "This is is a photo of 2 cats".

We can modify the code to match this behaviour with the manual approach, and we'll get the same results:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

labels = ["2 cats", "this is a photo of 2 dogs"]
texts = [f"this is a photo of {label}" for label in labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.2%} that '{texts[0]}'")  # 50.89% that 'this is a photo of 2 cats
aliencaocao commented 5 months ago

Wow that requires some docs...

amyeroberts commented 5 months ago

Hi @aliencaocao, thanks for raising this issue!

This is in the docs how it isn't super obvious. If you'd like to open a PR to update the example for the pipeline to highlight this I'd be very happy to review.

@NielsRogge Could you fix the example on the siglip page, as you'll have permissions to open and merge the PR there?

aliencaocao commented 5 months ago

If you'd like to open a PR to update the example for the pipeline to highlight this I'd be very happy to review.

Sure i will do it soon, thanks for pointing that out

aliencaocao commented 5 months ago

PR made