Closed chenhang98 closed 3 years ago
Hi @tinyalpha ,
In transformers, the conventional data format is [B, W*H, C]. Transformers usually use Layer Normalization that only applies to the channel dimension. If the data format is [B, C, W, H], like that in this repo, it means it only normalizes C, not including [W, H]. Currently, there is no PyTorch API LayerNorm for [B, C, W, H] data format to only normalize channel dimension, I implement it here and name it LayerNormChannel
. The Layer Normalization in the paper refers to this conventional Layer Normalization of transformers.
For data format [B, C, W, H], Group Normalization normalizes dimensions C, W, and H at the same time.
In summary (run in poolformer root dir):
import torch
import torch.nn as nn
from models.poolformer import LayerNormChannel # only normalize channel
from models.poolformer import GroupNorm # group == 1
C, W, H = 64, 56, 56 # data format [B, C, W, H]
layer_norm_channel = LayerNormChannel(C) # only normalize C, conventionally used by transformers and in this paper ablation study
print(layer_norm_channel.weight.shape) # learnable affine weight shape: torch.Size([64])
group_norm = GroupNorm(C) # normalize C, H, and W.
print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])
layer_norm = nn.LayerNorm([C, H, W]) # normalize C, H, and W.
print(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])
You may clearly know their differences from the above code. Therefore, "when the number of groups is 1, Group Normalization is equivalent to Layer Normalization" is NOT right. There is a small mistake in the PyTorch GroupNorm API explanation. The saying "Put all 6 channels into a single group (equivalent with LayerNorm)" in the API explanation is NOT right.
Thanks for your explanation!
You are welcome~
This really helps. Thanks a lot.
So it means, apply nn.LayerNorm
on (B, C, H, W) is equals to GroupNorm
with groups = 1 right?
The LayerNorm
that decreases the performance in the ablation study is the LayerNormChannel
not nn.LayerNorm
right?
Hi @iumyx2612 , thanks for your attention.
nn.LayerNorm" on (B, C, H, W) is NOT equal to
GroupNormwith groups = 1. In the example given above,
print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])while
print(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])`. The difference is their affine weight and bias shape.nn.LayerNorm
for input (B, N, C). Because this repo uses data format (B, C, H, W), the LayerNorm implementation in this repo is LayerNormChannel
corresponding to the results of the ablation study.In summary, to identify whether two Normalizations are equal or not, the first is to see which dims are used to compute mean and variance, the second is to see whether their shape of affine weight and bias. For (B, C, H, W) input, | Norm | Which dims to compute mean and variance | Shape of affine weight and bias |
---|---|---|---|
LayerNormChannel(num_channels=C) |
(C, ) | (C, ) | |
nn.GroupNorm(num_groups=1, num_channels=C) |
(C, H, W) | (C, ) | |
nn.LayerNorm(normalized_shape=(C, H, W)) |
(C, H, W) | (C, H, W) |
For (B, N, C) input (like timm's ViT implementation), | Norm | Which dims to compute mean and variance | Shape of affine weight and bias |
---|---|---|---|
nn.LayerNorm(normalized_shape=C) |
(C, ) | (C, ) |
You may also refer to another example that I implement general LayerNorm for different situations for data format (B, H, W, C).
Hi @iumyx2612 , thanks for your attention.
1. apply `nn.LayerNorm" on (B, C, H, W) is NOT equal to `GroupNorm`with groups = 1. In the example given above,`print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])`while`print(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])`. The difference is their affine weight and bias shape. 2. LayerNormChannel for input (B, C, H, W) is equal to `nn.LayerNorm` for input (B, N, C). Because this repo uses data format (B, C, H, W), the LayerNorm implementation in this repo is `LayerNormChannel` corresponding to the results of the ablation study.
In summary, to identify whether two Normalizations are equal or not, the first is to see which dims are used to compute mean and variance, the second is to see whether their shape of affine weight and bias. For (B, C, H, W) input, Norm Which dims to compute mean and variance Shape of affine weight and bias
LayerNormChannel(num_channels=C)
(C, ) (C, )nn.GroupNorm(num_groups=1, num_channels=C)
(C, H, W) (C, )nn.LayerNorm(normalized_shape=(C, H, W))
(C, H, W) (C, H, W)For (B, N, C) input (like timm's ViT implementation), Norm Which dims to compute mean and variance Shape of affine weight and bias
nn.LayerNorm(normalized_shape=C)
(C, ) (C, )You may also refer to another example that I implement general LayerNorm for different situations for data format (B, H, W, C).
Thanks for the detail explanation, I can see nn.LayerNorm
and nn.GroupNorm(num_groups=1)
compute mean and variance on the same dimension. However, the weight for the Norm
layers are different. Both nn.LayerNorm
and nn.GroupNorm(num_groups=1)
do the same job in this situation (the dims to compute mean and var are similar), so using nn.LayerNorm
performs worse than nn.GroupNorm
or better? Since the LayerNorm
in the abalation study is LayerNormChannel
(which only compute mean and var on C
) not nn.LayerNorm
(which compute mean and var similar as nn.GroupNorm
)
Hi @iumyx2612 , yes, the difference between nn.LayerNorm(normalized_shape=(C, H, W))
and nn.GroupNorm(num_groups=1, num_channels=C)
is their shapes of affine weight and bias. And the difference between nn.GroupNorm(num_groups=1, num_channels=C)
and ViT's LayerNorm is the dims to compute mean and variance. For your question, I did not conduct experiments for nn.LayerNorm(normalized_shape=(C, H, W))
because it means the model can only accept one resolution. So I don't have experiment results on hand for the comparison between nn.LayerNorm(normalized_shape=(C, H, W))
and nn.GroupNorm(num_groups=1, num_channels=C)
and sorry that I cannot give clear conclusion for it.
Hi @iumyx2612 , yes, the difference between
nn.LayerNorm(normalized_shape=(C, H, W))
andnn.GroupNorm(num_groups=1, num_channels=C)
is their shapes of affine weight and bias. And the difference betweennn.GroupNorm(num_groups=1, num_channels=C)
and ViT's LayerNorm is the dims to compute mean and variance. For your question, I did not conduct experiments fornn.LayerNorm(normalized_shape=(C, H, W))
because it means the model can only accept one resolution. So I don't have experiment results on hand for the comparison betweennn.LayerNorm(normalized_shape=(C, H, W))
andnn.GroupNorm(num_groups=1, num_channels=C)
and sorry that I cannot give clear conclusion for it.
Very clear explanation, thank you for taking your time to answer me! Much appreciated
Welcome~
Hi, thanks for your excellent work. In your ablation studies (section 4.4), you compared Group Normalization (group number is set as 1 for simplicity), Layer Normalization, and Batch Normalization. The conclusion is that Group Normalization is 0.7% or 0.8% higher than Layer Normalization or Batch Normalization. But when the number of groups is 1, Group Normalization is equivalent to Layer Normalization, right?