Full Attention does not sum to 1

idiap / fast-transformers

Pytorch library for fast transformer implementations

1.65k stars 179 forks source link

When creating a transformer with

encoder = TransformerEncoderBuilder.from_kwargs(
            n_layers=num_layers,
            n_heads=num_heads,
            query_dimensions=d_model,
            value_dimensions=d_model,
            feed_forward_dimensions=ff_dim,
            attention_type='full',
            activation='gelu',
        ).get()

and doing the forward pass with

encoder(data, data_mask)

and capturing the attention matrix as described in #20, the resulting attention matrix does not sum to 1. Am I missing something? Since the attention layer includes softmax, it should sum to 1, I think...

idiap / fast-transformers

Full Attention does not sum to 1 #131