Open breuderink opened 3 years ago
@breuderink ohh this is actually a trick from a Shazeer paper https://arxiv.org/pdf/2002.05202.pdf that should give an extra performance boost, but i should probably make it optional to stay faithful to the original paper
Reading the code I found the following implementation for the feed-forward MLP of the Perceiver IO:
I could not find references to a gated GELU in the PerceiverIO paper nor in in the code.
Is there a particular to use
GEGLU
instead of GELU?