patrickjohncyh / fashion-clip

FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.
MIT License
290 stars 34 forks source link

Logit scaling #12

Closed thomas-woodruff closed 1 year ago

thomas-woodruff commented 1 year ago

Hello there,

In the HuggingFace implementation the dot product of the text and image embeddings are scaled by logit_scale: https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py#L1152

In this repo it seems that there is no such scaling: https://github.com/patrickjohncyh/fashion-clip/blob/08be0d9416c743c599e7ef76cd00b72627b25270/fashion_clip/fashion_clip.py#L218

Which is the correct way?

Thanks, Tom

vinid commented 1 year ago

Hello!

Logit scaling is used for training, the code we have here is just for model usage.

FashionCLIP was trained with logit scaling

thomas-woodruff commented 1 year ago

Thanks for your reply.

My understanding is that because the model was trained with the logit scaling and a cross-entropy loss function, we should use the logit scaling when making inferences if we care about the probabilities being well calibrated. If we don't care about the probabilities themselves, e.g. just finding the most likely label for an image, the logit scaling is not necessary.

Is that correct?

vinid commented 1 year ago

You are right in the sense that scaling will give you probabilities that are distributed similarly to how the model was trained. I don't know your use case, but I expect the probabilities to be batch dependent, but could still be informative for classification, maybe (ranking is going to be the same, but distribution is going to be affected by the scaling)?

FWIW, you can find the logit scale of the model here, not that our logit scaling follows OpenAI's work and it was trainable parameter in our setup.

thomas-woodruff commented 1 year ago

My use case is zero-shot multiclass classification. Rather than just getting the most likely class label, I'd like to have to probability of each class. So I think in this case it's more appropriate to use the logit scaling - what do you think?

What do you mean when you say you expect the probabilities to be batch dependent?

For posterity, here's how to get the scaled and unscaled logits using HuggingFace:

from transformers import CLIPProcessor, CLIPModel
import torch

model_name = "patrickjohncyh/fashion-clip"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

classes = [ ... ]
images = [ ... ]

inputs = processor(
    text=classes, images=images, return_tensors="pt", padding=True
)

outputs = model(**inputs)

scaled_logits_per_text = outputs.logits_per_text

unscaled_logits_per_text = torch.matmul(outputs.text_embeds, outputs.image_embeds.t()) 

logit_scale = model.logit_scale.exp()
manually_scaled_logits_per_text = torch.matmul(outputs.text_embeds, outputs.image_embeds.t()) * logit_scale

Thanks!

vinid commented 1 year ago

What do you mean when you say you expect the probabilities to be batch dependent?

Sorry, I think I worded this in the wrong way. What I mean is that for every image, the label distribution you are going to get will depend on which labels are going to be considered.

E.g., when classifying a shoe, the distribution is going to be different if your list of labels was ["shoe", "skirt"] vs ["shoe", "red shoe"].

I am also not 100% sure that the model can be effectively used for multi-label classification but this might depend on the task you are actually trying to solve. Nonetheless, the model was trained using a contrastive loss that kinds of associated an object with a single description.

Can you rewrite your task as a series of binary tasks (e.g., long sleeves vs short sleeves)?