locuslab / T-MARS

Code for T-MARS data filtering
https://tmars-clip.github.io
MIT License
35 stars 5 forks source link

Robustness to typographic attacks? #1

Closed vishaal27 closed 1 year ago

vishaal27 commented 1 year ago

Hello,

Thanks so much for releasing your great paper, I enjoyed reading it and kudos on the utility analysis in section 7, it presents a clean picture of the original motivation of the paper. Since the models trained with T-MARS (and other intersecting baselines) lead to more "visually salient" representations due to the text masking process, it seems likely that this could be an effective strategy to mitigate typographic attacks---since the models trained on this filtered dataset "places less weight" on learning to do OCR, perhaps it abstracts away / learns to ignore text imposed onto images? I am curious to know if you have done any such robustness checks on typographic attacks, or any inference time analysis between the original models and the models trained on filtered data on samples containing text in the images?

SachinG007 commented 1 year ago

Hi @vishaal27 , Yes, the model trained using the filtered subset did show much less variation in CLIP score based on text present or not, compared to standard CLIP models. We haven't checked for robustness to typographic attacks formally, but intuitively we agree that the models trained on T-MARS filtered subset should be robust to such attacks.

Thanks