Discrepancy between Hugging Face and fashion-clip

thomas-woodruff commented 1 year ago

Hello there,

I was looking into the difference in performance between the Hugging Face implementation of FashionCLIP and this repo, which wraps around the former.

I noticed there's a discrepancy between the image embeddings produced by the two approaches. Having dug into it, it looks like the cause is that in this repo the images are put into a Hugging Face Dataset here before being passed to the model.

The below code illustrates the discrepancy:

from transformers import CLIPProcessor, CLIPModel
from fashion_clip.fashion_clip import FashionCLIP
import torch
from datasets import Dataset

model_name = "patrickjohncyh/fashion-clip"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

def get_image_embeddings_without_dataset(images):
    inputs = processor(images=images, return_tensors='pt')

    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)

    return embeddings.numpy()

def pass_images_through_data(images):
    dataset = Dataset.from_dict({'image': images})
    images = dataset['image']
    return images

def get_image_embeddings_with_dataset(images):
    images = pass_images_through_data(images)
    return get_image_embeddings_without_dataset(images)

hf_ds_embeddings = get_image_embeddings_with_dataset(images)
hf_wo_embeddings = get_image_embeddings_without_dataset(images)

fclip = FashionCLIP('fashion-clip')
fc_embeddings = fclip.encode_images(images, batch_size=batch_size)

In the above code the embeddings produced by passing the images through a Dataset,hf_ds_embeddings, are the same as those produced by this repo, fc_embeddings. The embeddings produced without using a Dataset, hf_wo_embeddings are slightly different.

I imagine that putting the images into the dataset is implicitly applying some transformation or pre-processing.

Just wanted to flag this, thanks!

vinid commented 1 year ago

I am surprised because both methods seem to use the same transformation, but I'll take a look! thanks!!

anilsathyan7 commented 1 year ago

This looks like a similar issue:-

from PIL import Image
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("patrickjohncyh/fashion-clip")
processor = CLIPProcessor.from_pretrained("patrickjohncyh/fashion-clip")

image = requests.get('https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)

test_captions = ['drawstring waist', 'paperbag waist', 'waist band']
test_img_path = 'paperbag_waist.jpg'
#display_images([test_img_path])
fclip.zero_shot_classification([test_img_path], test_captions)

The probs generated here and hugginface hosted inference UI seems to be different: https://huggingface.co/patrickjohncyh/fashion-clip. I beleive both should ideally output same probability for same input image? Are they both using latest v2 models?

Both the above methods classify image as 'drawstring waist' - wrongly. But it's correctly identified in the HF hosted inference API.

hf_fashion_clip

vinid commented 1 year ago

Hi @anilsathyan7!

I am not sure how the UI computes the score; in the meantime, I have run your example on both the original HF API and our internal wrapper and the results are more or less the same. Take a look:

img_url = "https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg"
image = requests.get(img_url').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)

>>> [0.1976, 0.0051, 0.7973]


test_captions = ['paperbag waist', 'waist band', 'drawstring waist']
test_img_path = 'paperbag_waist.jpg'

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

# note that we need to include logit scaling to get the same output the default hugging face model gives us
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.1976, 0.0051, 0.7972]

Which are reasonably similar scores.

anilsathyan7 commented 1 year ago

@vinid Ok, that's strange. The hosted api shows the image as 'paperbag waist' clearly with probs - 0.943. It's a large difference and the 'Hosted inference API' output is actually correct. What could be the reason for this?

vinid commented 1 year ago

it's an effect due to prompting, by default the pipeline component (included in the UI) uses the format "this is a photo of {}." See here.


test_img_path = 'paperbag_waist.jpg'
test_captions = ['This is a photo of paperbag waist.', 'This is a photo of waist band.', 'This is a photo of drawstring waist.']

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.6159, 0.0288, 0.3552]

(You have some typos in your screnshot, you should remove ' chars)

anilsathyan7 commented 1 year ago

@vinid Thanks a lot ... If we even change the full stop in caption, the result completely changes. Prompt Engineering !! 😅

dalphajw commented 1 year ago

Great find! I was just thinking the same thing and was pleasantly surprised to stumble onto this insightful thread. In my time using FashionCLIP, I did find the "photo of" trick works quite well but I didn't know that was the reason for the discrepancy. Thx all!

patrickjohncyh / fashion-clip

Discrepancy between Hugging Face and fashion-clip #14