hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link

Readability of CLIP notebook #28

Open josh-freeman opened 1 year ago

josh-freeman commented 1 year ago
josh-freeman commented 1 year ago

Oh I forgot to mention:

I also added a bit of documentation to the interpret function

hila-chefer commented 1 year ago

Hi @josh-freeman, thanks for your contribution to this repo! it'll take me some time to review and approve your PR since it contains a significant amount of changes, will get to it ASAP

josh-freeman commented 1 year ago

No worries, pretty sure it's mostly a bug coming from something like conversion of CRLF to LF tokens or something; I'm surprised it says I changed that much.

guanhdrmq commented 1 year ago

Dear all,

I have one problem for ViLT. So I try to reproduce the VisualBERT for implementing in VILT. Could you point where is the save_visual_results function definition? I use ViLT for multimodal transformer but cannot use num_tokens = image_attn_blocks[0].attn_probs.shape[-1] to set the number of tokens. For example, VILT for VQA task and image is 384*384 size. The number of vision and text mixed token is 185 including cls token, so the vision token is 144 and the text token is 40 (max length).

Thanks very much