Closed yourj4m closed 4 months ago
Okay nevermind, of course it sums to 1, I didn't encoder.eval()
before testing and the dropout layer in the attention caused it to be off..
However, there is one thing I still don't understand, namely that I get values larger than one when summing without setting the model to evaluation mode. If the dropout is applied before the softmax, that shouldn't be possible and neither when it is applied after the softmax.. If anybody has an idea I'd love to hear it.
When creating a transformer with
and doing the forward pass with
and capturing the attention matrix as described in #20, the resulting attention matrix does not sum to 1. Am I missing something? Since the attention layer includes softmax, it should sum to 1, I think...