hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link

Question about the CLIP Demo #24

Closed Hoyyyaard closed 2 years ago

Hoyyyaard commented 2 years ago

Hello, I have two questions about the CLIP demo, may I take the liberty of asking you? (1) variable "image_relevance" in function "interpret" This code "image_relevance = R[:, 0, 1:]"applied after updating the attention prob map; I confused about why choosing the first row of the relevant scores ; Is this corresponding to the saying in paper " to ex-tract relevancies per text token, one should consider the first row of Rtt, and to extract the image token relevancies, consider the first row in Rti which describes the connections between the [CLS] token and each image token" in section 3.1? Or can you give me some other insights?

(2) variable "R_text" in function "show_heatmap_on_text" I also confused about why "CLS_idx = text_encoding.argmax(dim=-1)", The [CLS] token is usually the first one in tokens. Can you give me some insights about this ?

Thanks!

hila-chefer commented 2 years ago

Hi @Hoyyyaard, thanks for your interest!

  1. Indeed, this code extracts the relevance scores for the CLS token, which is the first token for the image sequence. We also discard the first token’s relevance since the first token is the CLS itself.
  2. Usually the CLS token is the first one (especially for image sequences). In the case of the text sequence, this is the code to extract the CLS index by the official implementation, so again, this code simply extracts the CLS location to use it for the interpretation.

I hope this helps. Best, Hila.