mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.4k stars 989 forks source link

Is there a way to do multi-label classification with CLIP? #334

Closed justlike-prog closed 3 weeks ago

justlike-prog commented 1 year ago

The concrete use case is a as following. I have the classes baby, child, teen, adult. My idea was to use similarity between text and image features (for text features I used the prompt 'there is at least one (c) in the photo', c being one of the 4 classes).

I went through quite a lot of examples, but I am running into the issue that the similarity scores are often very different for a fixed class or/and classes that appear might have a very similar threshold (like baby and child). For similarity scores I use the cosine similarity multiplied by 2.5 to stretch the score into the interval [0, 1] as is done in the CLIP Score paper.

Setting a threshold in that sense doesn't seem possible.

Does anyone have an idea for that? I feel quite stuck here, how I should proceed.

mitchellnw commented 1 year ago

not sure if it would work but have you by any chance looked at using captions like "this is a photo of a ','.join(subset)" where subset iterates over all subsets of your current classes? so then you'd have 2^4 classes instead of 4

AmericanPresidentJimmyCarter commented 1 year ago

I am attempting this now training on captions with multiple labels and then querying with single labels, and it works pretty badly compared to any normal multi-label classifier.

{'f1': 0.08291136675917679, 'precision': 0.07481833065257353, 'recall': 0.10588978264912757}

If I figure this out I will let you know.

Msalehi237 commented 1 year ago

Take a look at this paper: "DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations"

I struggled with this problem for a while and this approach is working for me.

travellingsasa commented 8 months ago

@AmericanPresidentJimmyCarter did find a way to improve the multi-labelling performance?

AmericanPresidentJimmyCarter commented 7 months ago

No, I just trained multilabel classifiers instead and those worked.

miguelalba96 commented 7 months ago

@travellingsasa

You can do some sort of anti-text or placeholder text to do multi-label classification, ex:

your objective is checking in there is the presence of "red" in an image of a dress, then use:

["a red dress", "a dress"]

that will give you a probability distribution and you take the zero index

AmericanPresidentJimmyCarter commented 7 months ago

@travellingsasa

You can do some sort of anti-text or placeholder text to do multi-label classification, ex:

your objective is checking in there is the presence of "red" in an image of a dress, then use:

["a red dress", "a dress"]

that will give you a probability distribution and you take the zero index

How does that work? If the image contains neither your result will be essentially random. I think it only works if you have a multi-label classifier to identify a dress in the first place.