This paper provides new perspectives about Transformer block, but I have some questions about one of the details.
As far as I know, the LayerNorm officially provided by Pytorch implements the same function as the MLN, which computes the
mean and variance along token and channel dimensions. So where is the improvement?
The official example :
Image Example
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
This paper provides new perspectives about Transformer block, but I have some questions about one of the details. As far as I know, the LayerNorm officially provided by Pytorch implements the same function as the MLN, which computes the mean and variance along token and channel dimensions. So where is the improvement? The official example :
Image Example
N, C, H, W = 20, 5, 10, 10 input = torch.randn(N, C, H, W)
Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
as shown in the image below
layer_norm = nn.LayerNorm([C, H, W]) output = layer_norm(input)