Closed lhallee closed 1 month ago
a normal swiglue here (mlp)
a normal swiglue here (mlp)
This is showing up more often but using the w3 is definitely not the norm?
a normal swiglue here (mlp)
This is showing up more often but using the w3 is definitely not the norm?
I mean, it is a normal, i.e., vanilla swigule here, not a norm
a normal swiglue here (mlp)
This is showing up more often but using the w3 is definitely not the norm?
I mean, it is a normal, i.e., vanilla swigule here, not a norm
I meant "normal" not norm, sorry. Where is a swiglue mentioned in papers? Most transformers do not have three Linear layers in the MLP, including the original / vanilla transformer.
Hello,
Great work! Is it okay to say it is just a standard vanilla MLP block? According to the huggingface implementation there is an additional third linear layer and added elementwise multiplication.
I think this has been confusing to some readers, but perhaps this has been used before and I am unaware. Is there any insights you guys can offer about why this layer was added? It seems to add more expressiveness to the experts but I didn't know if you had experimented with and without it.