Closed yhl48 closed 6 months ago
Standard "-former" structure is: Skip(Norm->Attn/S5)->Skip(Norm->FFN/GLU)
, in this case FFN is the GLU variant. The paper doesn't specify any structure for the final model, but the reference code has the similar structure:
https://github.com/lindermanlab/S5/blob/3c18fdb6b06414da35e77b94b9cd855f6a95ef17/s5/layers.py#L63-L90
Although they use Skip(Norm->S5->FFN/GLU)
In case you are confused about ff_enc
naming, this is because it's actually a 2-layer mlp to fan-in and fan-out the GLU. so you encode into the ff dim, do the GLU and the decode back to the common hidden dim.
Thanks for the good work. Doesn't the encoder come before the S5 layer, or am I missing something here?
https://github.com/i404788/s5-pytorch/blob/5aad8a13c8c7c76e34b8786258e59a2f88e52936/s5/s5_model.py#L352