rwth-i6 / returnn_common

Common building blocks for RETURNN configs, such as models, training concepts, etc
7 stars 4 forks source link

GenericSelfAttention, biases are inconsistent to SelfAttentionLayer #234

Closed albertz closed 1 year ago

albertz commented 1 year ago

I noticed that the nn.SelfAttention is a bit different to SelfAttentionLayer: SelfAttentionLayer does not have biases for the qkv and proj linear projections, while nn.SelfAttention currently has.

This is relevant for Conformer (e.g. #233) and Transformer.

albertz commented 1 year ago

Well, maybe having those biases is actually standard? E.g. In Fairseq: https://github.com/facebookresearch/fairseq/blob/b4001184f49ed0e20d619b54bb3d43088fabf990/fairseq/modules/multihead_attention.py#L123-L131

albertz commented 1 year ago

Also used by default in PyTorch nn.MultiheadAttention (https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention).

albertz commented 1 year ago

Note that SelfAttentionLayer was designed 1:1 to be equivalent to the Tensor2Tensor code, and also looking at the current T2T code (I think here), it seems there is no bias there. So that was the original Transformer. But since then, it has evolved, and I think Fairseq is probably much more used.

albertz commented 1 year ago

In ESPNet, bias is also used: https://github.com/espnet/espnet/blob/a65cc78de7e18c867f4be5fc0b9b695875c78c70/espnet/nets/pytorch_backend/transformer/attention.py#L32-L34

albertz commented 1 year ago

So, as they seem to be standard nowadays, I think having them enabled is ok.

albertz commented 1 year ago

@patrick-wilken Are you aware of this?

albertz commented 1 year ago

I added the option with_bias, so you can specify it explicitly. The default is still True now.

patrick-wilken commented 1 year ago

No, I wasn't. You won't find papers discussing what difference it makes, right? Maybe I should try it out, bias seems like well spent parameters. 😄

albertz commented 1 year ago

Note that this with_bias was added to returnn-common. It's not available in SelfAttentionLayer.