Closed vishaal27 closed 1 year ago
Hi @vishaal27 , Yes, the model trained using the filtered subset did show much less variation in CLIP score based on text present or not, compared to standard CLIP models. We haven't checked for robustness to typographic attacks formally, but intuitively we agree that the models trained on T-MARS filtered subset should be robust to such attacks.
Thanks
Hello,
Thanks so much for releasing your great paper, I enjoyed reading it and kudos on the utility analysis in section 7, it presents a clean picture of the original motivation of the paper. Since the models trained with T-MARS (and other intersecting baselines) lead to more "visually salient" representations due to the text masking process, it seems likely that this could be an effective strategy to mitigate typographic attacks---since the models trained on this filtered dataset "places less weight" on learning to do OCR, perhaps it abstracts away / learns to ignore text imposed onto images? I am curious to know if you have done any such robustness checks on typographic attacks, or any inference time analysis between the original models and the models trained on filtered data on samples containing text in the images?