hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link

Questions about CLIP visualization. #10

Closed tingxueronghua closed 2 years ago

tingxueronghua commented 2 years ago

I do not understand why the visualization of CLIP only calculates the last two layers, due to the existence of num_layers. Could you share some insights with us?

The existence of variable "num_layers" makes the heat map of CLIP clearer, but I think visualization of CLIP and ViT should be similar or at least comparable because they share the same architecture.

hila-chefer commented 2 years ago

Hi @tingxueronghua, thanks for your interest in our work! See this issue regarding ViT-B/16. The number of layers we consider expand the context of the tokens from the last layers. For CLIP with ViT-B/32 the context expansion has little to no negative effect, but for ViT-B/16 people noticed that the expanding context causes artifacts to be further highlighted, so I added support to control the level of context expansion. See the discussion on the issue for further details.

I hope I was able to answer your question, but feel free to ask for clarifications if you need them.

Best, Hila.

tingxueronghua commented 2 years ago

Thanks for your explanation! Sorry for bothering, but I still have another question. Given the "num_layers", I think we are assuming that the i th token of an intermediate feature could correspond to the i th token of the input? I have seen some other visualization library considering the intermediate layers, but I am not sure whether this is just an assumption.

hila-chefer commented 2 years ago

Hi @tingxueronghua, feel free to ask anything :) If I understand your question correctly, you are indeed right. We assume that the i-th row of the attention matrix corresponds to the i-th token in the original sequence. When we do not propagate relevance all the way to the first attention layer, we indeed assume that the intermediate representation corresponds to the input token.

tingxueronghua commented 2 years ago

Thank you very much! I close this issue since all the questions have been answered.