Question about the CLIP Demo

hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

MIT License

801 stars 107 forks source link

Hello, I have two questions about the CLIP demo, may I take the liberty of asking you? (1) variable "image_relevance" in function "interpret" This code "image_relevance = R[:, 0, 1:]"applied after updating the attention prob map; I confused about why choosing the first row of the relevant scores ; Is this corresponding to the saying in paper " to ex-tract relevancies per text token, one should consider the first row of Rtt, and to extract the image token relevancies, consider the first row in Rti which describes the connections between the [CLS] token and each image token" in section 3.1? Or can you give me some other insights?

(2) variable "R_text" in function "show_heatmap_on_text" I also confused about why "CLS_idx = text_encoding.argmax(dim=-1)", The [CLS] token is usually the first one in tokens. Can you give me some insights about this ?

Thanks!

hila-chefer / Transformer-MM-Explainability

Question about the CLIP Demo #24