Closed oliverdutton closed 3 years ago
Update I'm wrong, you need the relprop still to get del y_t / del A in eq 5, MM paper.
Update 2
I found what I was describing in DETR/modules/layers.py::MultiheadAttention
. Thank you.
I have the same question, is there a colab for Transformer-MM-Explainability on a Vanilla Transformer model (maybe VIT)? I believe this is important for comparing two methods. @hila-chefer
Thank you for your help.
Hi @oliverdutton and @betterze! First, thanks for your interest in our work. I created a colab notebook with ViT examples per your requests. A few things worth noticing:
Hila.
@hila-chefer Thanks a lot. It is very helpful.
Thank you so much @hila-chefer.
I see I was right, then I was wrong, I hadn't quite understood the one_hot.backward(retain_graph=True)
was how you were getting del y_t / del A in eq 5 in MM paper I was worrying about. That is very neat.
I was a little lost, and the 'hand holding' of the notebook and stripped ViT_new.py is very useful, so thank you as I can understand it's not the most exciting thing to write for you.
@oliverdutton it was no bother at all, I'm really glad it helped you to better understand the paper :) please feel free to ask questions :) I'd be very happy to help.
@hila-chefer Thank you for releasing the notebook. I am trying to implement the explanation method for Vanilla BERT, Roberta, and other text-transformer architectures. If possible, can you please share if there's any easier way to add the relevant modifications to the Hugging Face models? Thank You.
Wonderful paper. To see if I'm getting something very wrong this is my understanding of the two papers differences.
Transformer-Explainability You generate the building blocks with relprop as an alongside function to propagate backward the relevancies.
This is LRP, working from the CAMs (Class Activation Mapping) backwards. So you propagate outputs to inputs. To back propogate you have to hard code in all the flow - i.e. all the concats and splits of data, eg when you have to diverge to
cam1
,cam2
then use/=2
for matrix multiplication then rejoin them in theclone
during self attention. This is awkward, and is what you're referring to when you say 'LRP requires a custom implementation of all network layers.' in the MM paper.Transformer-MM-Explainability You work forwards and simply add the methods
and the hooks in the forward pass to save them. This makes tracking things far easier as you don't need to reverse engineer the flow.
Request You have great alterations of the DETR, CLIP, LXMERT and VisualBERT repos allowing all the interaction coupling scores for the baselines and your method to be calculated and smoothly plug-in to the whole repo.
Could you provide an example using the forward pass formulation of Transformer-MM-Explainability on a Vanilla Transformer model (just your interaction scores, not all the other baseline methods) to act as a very simple demonstrative example which is single-modality, ideally a Jupyter notebook with comments.