Closed ThisisBillhe closed 2 years ago
Hi @ThisisBillhe, thanks for your interest!
The motivation is to average across the different attention heads. A simple average ignores the different "roles" of each head, so we use gradients to obtain a class-specific signal which determines the "relevance" of each head to the output prediction. Using R(A) alone will result in a class-agnostic and noisy relevance maps.
I hope this helps. Best, Hila.
Thank you for the outstanding work!! However, I wonder why you multiply R(A) by G(A) to get the final result. According to my calculation, R(A) is equal to A * G(A) / C, where C is a constant. What would happen if we use R(A) alone? And what’s the motivation to multiply them together?
Looking forward to your reply!!