Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.
I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.
If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.
So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?
Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.
I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.
If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.
So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?