nn.LayerNorm is a function with learnable parameters, it not only normalize the input, but also learn the possible data distribution, I think different layers in the encoder block(eg. conv layer,self-attention layer, feed forward layer) should have different learnable layernorm.
nn.LayerNorm is a function with learnable parameters, it not only normalize the input, but also learn the possible data distribution, I think different layers in the encoder block(eg. conv layer,self-attention layer, feed forward layer) should have different learnable layernorm.