GenericSelfAttention, biases are inconsistent to SelfAttentionLayer

rwth-i6 / returnn_common

Common building blocks for RETURNN configs, such as models, training concepts, etc

7 stars 4 forks source link

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

Closed albertz closed 2 years ago

albertz commented 2 years ago

I noticed that the nn.SelfAttention is a bit different to SelfAttentionLayer: SelfAttentionLayer does not have biases for the qkv and proj linear projections, while nn.SelfAttention currently has.

This is relevant for Conformer (e.g. #233) and Transformer.

albertz commented 2 years ago

Well, maybe having those biases is actually standard? E.g. In Fairseq: https://github.com/facebookresearch/fairseq/blob/b4001184f49ed0e20d619b54bb3d43088fabf990/fairseq/modules/multihead_attention.py#L123-L131

albertz commented 2 years ago

Also used by default in PyTorch nn.MultiheadAttention (https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention).

albertz commented 2 years ago

Note that SelfAttentionLayer was designed 1:1 to be equivalent to the Tensor2Tensor code, and also looking at the current T2T code (I think here), it seems there is no bias there. So that was the original Transformer. But since then, it has evolved, and I think Fairseq is probably much more used.

albertz commented 2 years ago

In ESPNet, bias is also used: https://github.com/espnet/espnet/blob/a65cc78de7e18c867f4be5fc0b9b695875c78c70/espnet/nets/pytorch_backend/transformer/attention.py#L32-L34

albertz commented 2 years ago

So, as they seem to be standard nowadays, I think having them enabled is ok.

albertz commented 2 years ago

@patrick-wilken Are you aware of this?

albertz commented 2 years ago

I added the option with_bias, so you can specify it explicitly. The default is still True now.

patrick-wilken commented 2 years ago

No, I wasn't. You won't find papers discussing what difference it makes, right? Maybe I should try it out, bias seems like well spent parameters. 😄

albertz commented 2 years ago

Note that this with_bias was added to returnn-common. It's not available in SelfAttentionLayer.