Checking the code I found the head uses squared-ReLU instead of star-ReLU and after some experiments replacing it, I found the performance actually decreased. Was there a reason to select squared-ReLU specifically for the classifier head?
The reason is that a LayerNorm is put after the activation, before the final FC (see this line), to stabilize training. LayerNorm can normalize the input distribution, so using Squared ReLU is enough in this case.
Checking the code I found the head uses squared-ReLU instead of star-ReLU and after some experiments replacing it, I found the performance actually decreased. Was there a reason to select squared-ReLU specifically for the classifier head?