Paper: Surveillance section

Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.

I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.

If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.

So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?

openai / CLIP

Paper: Surveillance section #34