Closed benjaminbergner closed 3 years ago
Hi,
I assume you refer to the implementation here. As you can see, a gradient is returned both for the features and the attention.
Regarding the derivation, the logic is as follows:
dL/dθ = dL/dE[f] dE[f]/dθ
= dL/dE[f] (d(a f)/dθ)/a
= dL/dE[f] ( da/dθ f/a + df/dθ a/a)
= dL/dE[f] ( da/dθ f/a + df/dθ)
This means that the gradient wrt the features is as if we had no attention distribution and the gradient wrt to the attention distribution is scaled inversely proportional to the attention scores.
Let me know if this actually helped clarify the situation a bit.
Cheers, Angelos
Hi Angelos,
thanks a lot for your explanation. The concept is clear to me now 👍
Hi, thanks for your paper and the code base.
I have a question about eq. 12. In the paper, the derivative is taken of features multiplied by attention scores. However, in the backward pass (in ExpectWithReplacement), only the features are considered.
I probably misunderstand something, so I'd appreciate clarification. Thanks in advance.