Why does the classifier head use a different non-linearity from the rest of the architecture?

sail-sg / metaformer

MetaFormer Baselines for Vision (TPAMI 2024)

https://arxiv.org/abs/2210.13452

Apache License 2.0

416 stars 27 forks source link

Why does the classifier head use a different non-linearity from the rest of the architecture? #6

Closed JRestom closed 1 year ago

JRestom commented 1 year ago

Checking the code I found the head uses squared-ReLU instead of star-ReLU and after some experiments replacing it, I found the performance actually decreased. Was there a reason to select squared-ReLU specifically for the classifier head?

yuweihao commented 1 year ago

Hi, @JRestom , thanks for your attention.

The reason is that a LayerNorm is put after the activation, before the final FC (see this line), to stabilize training. LayerNorm can normalize the input distribution, so using Squared ReLU is enough in this case.