openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.81k stars 3.3k forks source link

Using CLIP in transfer learning for multilabel classification #334

Closed EachOneChew closed 11 months ago

EachOneChew commented 1 year ago

First of all I apologize if this is the wrong place to be posting a question like this. If that is the case, please let me know (and point me towards where I should go).

When applied in zero-shot learning, CLIP considers an image, and assigns to it the most relevant text prompt. This is a multinomial logistic regression problem, as stated in the paper, or a type of multiclass classification.

I am hoping to adapt CLIP to perform multilabel classification, wherein CLIP assigns all relevant text prompts when given an image and a set of text prompts. Another way of framing the problem is to treat it as multiple binary classification problems; for each text prompt, decide whether or not to an image.

To this end I am simply taking the unscaled logits (produced by taking the cosine similarity between image and text embeddings) output by CLIP. Instead of applying softmax over the unscaled logits, I set a threshold and assign all text prompts with a similarity score over a certain threshold.

Doing so has gotten me decent results so far, but I would like to ask for the input of people more knowledgeable than me. Is it safe to assume that the cosine similarity between image and text embeddings are meaningful indicators for their relatedness?

A slightly related question: I have only ever observed positive cosine similarities (or logits_per_image) output by CLIP. I was under the impression that the range logit scores should be (-1, 1) or (-100, 100) after multiplying logit_scale. Why is this?

Thank you!

EachOneChew commented 1 year ago

Coming back to give an update on this: the strategy I described worked very well. I did some fine tuning using the original CLIP objective which had a large improvement on multilabel classification (ignoring the problem of in-batch false negatives).

Regarding the positive logits, here is a paper explaining the behaviour.

fm1320 commented 12 months ago

This is a great approach, do you have any code/learnings to share?

talrejanikhil commented 11 months ago

I am also interested in knowing how CLIP can be used for multilabel classification

talrejanikhil commented 11 months ago

@EachOneChew could you share some code sample to show how you achieved this?

EachOneChew commented 11 months ago

@talrejanikhil @fm1320 Hi, I unfortunately do not have access to the code I wrote. Here are some points that may help:

That's all I remember, good luck with your projects 👍

abhijitherekar commented 7 months ago

Hi @EachOneChew , thanks for helping with the training code but when you say the following:

I used CLIP ViT-L14 and trained with the bottom half of layers on both encoders frozen (iirc one of the two transformer has more layers, so you freeze more layers on that one). If you don't do this you get memory issues unless using distributed loss.

Please, can you point me to the training code that you used to freeze the layer of the clip mode. Also, please can you provide on how you make to that conclusion of freezing the layers.

Is it mentioned in some official paper.

Thanks

EachOneChew commented 7 months ago

@abhijitherekar there is no reference to freezing layers officially because they used distributed loss when training. Freezing the layers was an adjustment I made myself to account for memory limitations on individual devices with which I saw good results.

To freeze a layer, set requires_grad of the parameter (layer) in a model to False. For example, to freeze every layer do:

for param in model.parameters():
    param.requires_grad = False

Look through the model parameters yourself to determine which to freeze.

miguelalba96 commented 6 months ago

@EachOneChew How did you determined the threshold for similarities?. I noticed that in general CLIP similarities are very low between image-text pairs but relatively high between similar text-text and image-image pairs. This happens even after fine-tuning it on massive data. Just to try you can take a pre-trained model from HF, then an image of a tiger and the caption "a photo of a tiger", then the cosine similarity will be around 0.3

if you have fine-tuned CLIP have you checked during training how the similarity increases between the images and their corresponding texts?, for me it hasn't increased more than 0.31-0.32 when fine-tuning it using a distributed loss on massive fashion data, so I stopped paying attention to it and monitor only zero shot acc performance like most of the people online (that increases).

for multi-label I am taking the top-k prediction and if the difference between the largest probability and second is small I output both