openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.78k stars 3.3k forks source link

Paper: Surveillance section #34

Closed nikky4D closed 3 years ago

nikky4D commented 3 years ago

Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.

I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.

If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.

So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?

nikky4D commented 3 years ago

FIgured it out from colab. Thanks