Open Akshay1-6180 opened 9 months ago
Do you know why use GELU here? @Akshay1-6180
so based on experiments it was found that GELU has a significantly smoother gradient transition and its not abrupt or sharp like relu , if u look at both the functions u would understand. Moreover look at the GPT2 code , they use gelu and many other models i have encountered also use GELU so went with it.
Going through these papers 1) https://arxiv.org/pdf/1603.05027.pdf 2) https://arxiv.org/pdf/2302.06112.pdf
I feel the order should be this