Closed aliencaocao closed 4 months ago
I found that the pipeline uses a prompt, which adds "This is a photo of " before every candidate label.
For example, when we pass in the label "2 cats", the pipeline converts it to "This is is a photo of 2 cats".
We can modify the code to match this behaviour with the manual approach, and we'll get the same results:
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["2 cats", "this is a photo of 2 dogs"]
texts = [f"this is a photo of {label}" for label in labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.2%} that '{texts[0]}'") # 50.89% that 'this is a photo of 2 cats
Wow that requires some docs...
Hi @aliencaocao, thanks for raising this issue!
This is in the docs how it isn't super obvious. If you'd like to open a PR to update the example for the pipeline to highlight this I'd be very happy to review.
@NielsRogge Could you fix the example on the siglip page, as you'll have permissions to open and merge the PR there?
If you'd like to open a PR to update the example for the pipeline to highlight this I'd be very happy to review.
Sure i will do it soon, thanks for pointing that out
PR made
System Info
transformers
version: 4.41.0Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Using sample code from https://huggingface.co/google/siglip-so400m-patch14-384:
The output mismatches with the pipeline approach:
and the difference is massive. Pipeline approach seem to give the right results, which also align with inference API.
Expected behavior
Correct result for the manual approach