pengzhiliang / Conformer

Official code for Conformer: Local Features Coupling Global Representations for Visual Recognition
Apache License 2.0
546 stars 87 forks source link

QA:Norm acts on Up and Down sampler respectively #4

Closed xesdiny closed 3 years ago

xesdiny commented 3 years ago

Hi,Gay. I do not understand why LayerNorm in down-sampler FCU and BatchNorm in up-sampler FCU are used to regularize features.Is there any special meaning?

pengzhiliang commented 3 years ago

Hi~ : As you know, BN and LN perform normalization in different dimensions, and we use the BN in the CNN branch and LN in the transformer branch. If fusing them without any normalization, it will cause feature misalignment, and then further lead to training collapse. Therefore, in the process of converting the CNN feature to the transformer feature, we use LN to normalize the CNN feature to match the representation of transformer branch.

xesdiny commented 3 years ago

As you know, BN and LN perform normalization in different dimensions, and we use the BN in the CNN branch and LN in the transformer branch. If fusing them without any normalization, it will cause feature misalignment, and then further lead to training collapse.

Emm,I mean is why not both use BN to Norm. I think LN is suitable for LSTM (RNN) acceleration, but it does not achieve better results than BN when used for CNN acceleration.

pengzhiliang commented 3 years ago

Emm, we are not sure whether using BN in vit will be effective, so the original structure of vit is retained. As for FCU, the use of LN is just to align the two features. If both use BN, it can't achieve the alignment goal and cause Nan.

xesdiny commented 3 years ago

As for FCU, the use of LN is just to align the two features. If both use BN, it can't achieve the alignment goal and cause Nan.

LGTM!