Application to Sparse/Low-Rank Attention Matrices

hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

MIT License

801 stars 107 forks source link

Application to Sparse/Low-Rank Attention Matrices #26

Closed FarzanT closed 1 year ago

FarzanT commented 1 year ago

Hello, Excellent work!

I was wondering if this explanation method is applicable to efficient transformers (such as those summarized in: https://arxiv.org/abs/2009.06732) that use lower rank or sparse attention matrices? In its current form, you would need the full, square (nxn) attention matrix for generation explanations. How can one adapt your method to an efficient transformer, such as the Reformer (https://arxiv.org/abs/2009.06732)?

hila-chefer commented 1 year ago

Hi @FarzanT, thanks for your interest! I’m not very familiar with the exact implementation details on the Reformer, but there are ways to deal with non-squared matrices. The easiest one would be simply considering the last attention layer only. That way there’s no need to multiply different attention layers.

If you still wish to track the attention through the layers, please consider following the model’s forward pass, as suggested in the paper.

Best, Hila.