lessw2020 / res2net-plus

Res2Net architecture with improved stem and Mish activation function
Apache License 2.0
136 stars 33 forks source link

GELU? #5

Closed hendrycks closed 4 years ago

hendrycks commented 4 years ago

Do you know how this model does with the GELU activation? It's nn.GELU, F.gelu in PyTorch. This was proposed a few years before Mish and is the default activation in BERT-based architectures.

lessw2020 commented 4 years ago

Hi @hendrycks - I haven't expressly tested GELU but in the papers I've seen Mish and Swish have been the two dominant activations in broad testing. There is a new one from MSFT called dynamic ReLU that is probably going to top all of them (both upper and lower slope are learned during training so it adapts to the data) but I have only tested it a little bit.
You can certainly plug in GELU here and try it out on your data to see how it works but in general Mish/Swish are the two that have performed the best to date (and with dynamic ReLU likely to outperform those 2). Hope that helps!

hendrycks commented 4 years ago

We tested the swish (SiLU) in the GELU paper and chose the GELU over the swish 1.5 years before the swish was independently proposed by Brain. When we proposed both x Phi(x) and x sigmoid(x), we chose x * Phi(x) since it was somewhat better. For the swish and Mish paper, they proposed their nonlinearity before running any comparisons with the GELU, so comparisons had incentives to preserve the story (version 1 of the swish paper and the Mish paper did not compare to GELU, then were made aware of it, then ran comparisons). I don't have enough details to reproduce the swish paper's hyperparameter search.

but in the papers I've seen Mish and Swish have been the two dominant activations in broad testing. Yes, people ignored the GELU paper which proposed x Phi(x) and x sigmoid (x) since it did not have ImageNet experiments (there were simply not resources to run such expensive experiments). People started using x sigmoid(x) after Google reproposed it. After many years of no adoption, the NLP community is using the GELU as the main nonlinearity (BERT, RoBERTa, XLNet, etc.). Neither the mish nor swish are in the PyTorch library (while the GELU is), so I'm not sure it's obvious that x sigmoid(x) is clearly better and that the GELU can be easily dismissed. Hopefully this doesn't come off as snaky and lets you know where I'm coming from

On Fri, May 1, 2020 at 9:48 AM Less Wright notifications@github.com wrote:

Closed #5 https://github.com/lessw2020/res2net-plus/issues/5.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lessw2020/res2net-plus/issues/5#event-3293755062, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZBITRNUUFRDNRPKD5X6TDRPL4MFANCNFSM4MXFDETQ .