Open thomas-woodruff opened 1 year ago
I am surprised because both methods seem to use the same transformation, but I'll take a look! thanks!!
This looks like a similar issue:-
from PIL import Image
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("patrickjohncyh/fashion-clip")
processor = CLIPProcessor.from_pretrained("patrickjohncyh/fashion-clip")
image = requests.get('https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg').content
image = Image.open(BytesIO(image))
inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
print(probs)
test_captions = ['drawstring waist', 'paperbag waist', 'waist band']
test_img_path = 'paperbag_waist.jpg'
#display_images([test_img_path])
fclip.zero_shot_classification([test_img_path], test_captions)
The probs generated here and hugginface hosted inference UI seems to be different: https://huggingface.co/patrickjohncyh/fashion-clip. I beleive both should ideally output same probability for same input image? Are they both using latest v2 models?
Both the above methods classify image as 'drawstring waist' - wrongly. But it's correctly identified in the HF hosted inference API.
Hi @anilsathyan7!
I am not sure how the UI computes the score; in the meantime, I have run your example on both the original HF API and our internal wrapper and the results are more or less the same. Take a look:
img_url = "https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg"
image = requests.get(img_url').content
image = Image.open(BytesIO(image))
inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
print(probs)
>>> [0.1976, 0.0051, 0.7973]
test_captions = ['paperbag waist', 'waist band', 'drawstring waist']
test_img_path = 'paperbag_waist.jpg'
images = [test_img_path]
texts = test_captions
# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)
# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)
# note that we need to include logit scaling to get the same output the default hugging face model gives us
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)
>>> [0.1976, 0.0051, 0.7972]
Which are reasonably similar scores.
@vinid Ok, that's strange. The hosted api shows the image as 'paperbag waist' clearly with probs - 0.943. It's a large difference and the 'Hosted inference API' output is actually correct. What could be the reason for this?
it's an effect due to prompting, by default the pipeline component (included in the UI) uses the format "this is a photo of {}." See here.
test_img_path = 'paperbag_waist.jpg'
test_captions = ['This is a photo of paperbag waist.', 'This is a photo of waist band.', 'This is a photo of drawstring waist.']
images = [test_img_path]
texts = test_captions
# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)
# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)
>>> [0.6159, 0.0288, 0.3552]
(You have some typos in your screnshot, you should remove ' chars)
@vinid Thanks a lot ... If we even change the full stop in caption, the result completely changes. Prompt Engineering !! 😅
Great find! I was just thinking the same thing and was pleasantly surprised to stumble onto this insightful thread. In my time using FashionCLIP, I did find the "photo of" trick works quite well but I didn't know that was the reason for the discrepancy. Thx all!
Hello there,
I was looking into the difference in performance between the Hugging Face implementation of FashionCLIP and this repo, which wraps around the former.
I noticed there's a discrepancy between the image embeddings produced by the two approaches. Having dug into it, it looks like the cause is that in this repo the images are put into a Hugging Face Dataset here before being passed to the model.
The below code illustrates the discrepancy:
In the above code the embeddings produced by passing the images through a Dataset,
hf_ds_embeddings
, are the same as those produced by this repo,fc_embeddings
. The embeddings produced without using a Dataset,hf_wo_embeddings
are slightly different.I imagine that putting the images into the dataset is implicitly applying some transformation or pre-processing.
Just wanted to flag this, thanks!