Closed FarzanT closed 1 year ago
Hi @FarzanT, thanks for your interest! I’m not very familiar with the exact implementation details on the Reformer, but there are ways to deal with non-squared matrices. The easiest one would be simply considering the last attention layer only. That way there’s no need to multiply different attention layers.
If you still wish to track the attention through the layers, please consider following the model’s forward pass, as suggested in the paper.
Best, Hila.
Hello, Excellent work!
I was wondering if this explanation method is applicable to efficient transformers (such as those summarized in: https://arxiv.org/abs/2009.06732) that use lower rank or sparse attention matrices? In its current form, you would need the full, square (nxn) attention matrix for generation explanations. How can one adapt your method to an efficient transformer, such as the Reformer (https://arxiv.org/abs/2009.06732)?