Closed betterze closed 3 years ago
Hey @betterze, thanks for your interest in our work! Glad to hear you find it useful!
We are currently working on adjusting our method to multi-modal transformer-based models such as VisualBERT, ViLBERT, VL-BERT, and more, and then aligning the text tokens and image regions is definitely a priority.
Regarding CLIP, I really liked the proposed method too, and I think it will be fascinating to attempt to explain it, but CLIP is quite different in that it does not contain an attention layer involving both the text and the images. Our method is heavily based on the connections that the attention matrix creates between tokens, and here we don't have that for the text and image combined (we have one for the text and one for the image separately). For that reason, work on CLIP isn't included in our current efforts, we may expand to it in the future, but it isn't currently prioritized.
Hey, @hila-chefer,
Thank you for your detailed reply. Clip has two separate transformer encoder, so it is hard to visualize the attention for each word to the image. So maybe it is not possible to visualize the attention of each word.
By the way, if I understand correctly, in ./baselines/ViT/ViT_LRP, you reimplement the ViT model. The Clip also uses ViT model for the image encoder. Does it possible to load the Clip image encoder to the ViT model of your implementation ?
I try model.load_state_dict(torch.load(clip_path)), it dose not work. I believe this is due to the implementation difference of yours and Clip, is this correct?
Thank you very much for your help. You are very nice.
Best Wishes,
Alex
@betterze of course, happy to help.
In this setting, it will be possible to visualize each modality on its own with our method (each text token w.r.t. each text token, and same for image patches). You could also visualize each modality w.r.t. its "classification token" (in CLIP- for ViT it's the actual classification token, and for the language transformer it's the [EOS] token if I'm not mistaken). Regarding the implementation for ViT, mine is based on two other great open source repos (https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py, https://github.com/lucidrains/vit-pytorch), and other than the fact that the implementations are different, I think they also mention in CLIP that they changed the original ViT architecture a bit, so that's worth noticing.
In any case, all the layer implementations with LRP should be already in our layers implementation, and I think that should help you in case you want to use our method :)
@hila-chefer Thx for your reply. You suggestions are very helpful. I will try it.
Hi @betterze, great news :) we've added support for CLIP with our new method in our new repo, there's also a colab notebook to run examples! hope this helps :)
@hila-chefer Thank you for sharing with me this great work, the results look very impressive.
Dear Hila,
Thank you very much for your great wok. I really like it.
Recently, Open AI release a model that embed text and image into a comment space using transformer (https://github.com/openai/CLIP). I am wondering if I can use your work to visilize CLIP model, such that for each word in the sentence, we can see the which area the word pay attentions to?
A example could be found in (https://github.com/jackroos/VL-BERT). In this work, they only show layerwise attention, which is not as good as the one you show in your paper.
Thank you for your help.
Best Wishes,
Alex