idiap / fast-transformers

Pytorch library for fast transformer implementations
1.65k stars 179 forks source link

Full Attention does not sum to 1 #131

Closed yourj4m closed 4 months ago

yourj4m commented 4 months ago

When creating a transformer with

encoder = TransformerEncoderBuilder.from_kwargs(
            n_layers=num_layers,
            n_heads=num_heads,
            query_dimensions=d_model,
            value_dimensions=d_model,
            feed_forward_dimensions=ff_dim,
            attention_type='full',
            activation='gelu',
        ).get()

and doing the forward pass with

encoder(data, data_mask)

and capturing the attention matrix as described in #20, the resulting attention matrix does not sum to 1. Am I missing something? Since the attention layer includes softmax, it should sum to 1, I think...

yourj4m commented 4 months ago

Okay nevermind, of course it sums to 1, I didn't encoder.eval() before testing and the dropout layer in the attention caused it to be off.. However, there is one thing I still don't understand, namely that I get values larger than one when summing without setting the model to evaluation mode. If the dropout is applied before the softmax, that shouldn't be possible and neither when it is applied after the softmax.. If anybody has an idea I'd love to hear it.