Closed xesdiny closed 3 years ago
Hi~ : As you know, BN and LN perform normalization in different dimensions, and we use the BN in the CNN branch and LN in the transformer branch. If fusing them without any normalization, it will cause feature misalignment, and then further lead to training collapse. Therefore, in the process of converting the CNN feature to the transformer feature, we use LN to normalize the CNN feature to match the representation of transformer branch.
As you know, BN and LN perform normalization in different dimensions, and we use the BN in the CNN branch and LN in the transformer branch. If fusing them without any normalization, it will cause feature misalignment, and then further lead to training collapse.
Emm,I mean is why not both use BN to Norm. I think LN is suitable for LSTM (RNN) acceleration, but it does not achieve better results than BN when used for CNN acceleration.
Emm, we are not sure whether using BN in vit will be effective, so the original structure of vit is retained. As for FCU, the use of LN is just to align the two features. If both use BN, it can't achieve the alignment goal and cause Nan.
As for FCU, the use of LN is just to align the two features. If both use BN, it can't achieve the alignment goal and cause Nan.
LGTM!
Hi,Gay. I do not understand why LayerNorm in down-sampler FCU and BatchNorm in up-sampler FCU are used to regularize features.Is there any special meaning?