Closed albertz closed 2 years ago
Well, maybe having those biases is actually standard? E.g. In Fairseq: https://github.com/facebookresearch/fairseq/blob/b4001184f49ed0e20d619b54bb3d43088fabf990/fairseq/modules/multihead_attention.py#L123-L131
Also used by default in PyTorch nn.MultiheadAttention
(https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention).
Note that SelfAttentionLayer
was designed 1:1 to be equivalent to the Tensor2Tensor code, and also looking at the current T2T code (I think here), it seems there is no bias there. So that was the original Transformer. But since then, it has evolved, and I think Fairseq is probably much more used.
So, as they seem to be standard nowadays, I think having them enabled is ok.
@patrick-wilken Are you aware of this?
I added the option with_bias
, so you can specify it explicitly. The default is still True now.
No, I wasn't. You won't find papers discussing what difference it makes, right? Maybe I should try it out, bias seems like well spent parameters. 😄
Note that this with_bias
was added to returnn-common. It's not available in SelfAttentionLayer
.
I noticed that the
nn.SelfAttention
is a bit different toSelfAttentionLayer
:SelfAttentionLayer
does not have biases for theqkv
andproj
linear projections, whilenn.SelfAttention
currently has.This is relevant for Conformer (e.g. #233) and Transformer.