Closed ZhiyuanChen closed 2 weeks ago
Hello, thank you for your kind words!
Placement of layer norm you mentioned is indeed a slight oversight but I would advise against moving the layer norm if you decide to use our pretrained weights as we trained the model with the current implementation. I cannot guarantee how well (or bad) the weights would perform with the slightly different architecture.
Thank you for your clarification.
Yes, I kept your original implementation for consistency.
Hey,
Thank you for your great work, it looks awesome.
When I tried to reproduce your work, I noticed a tiny issue:
In line 122-128 of module, you used the layer norm output of hidden states as residual path of attention, where in standard transformer implementation, the hidden states before layer norm is used as residual path.
There shouldn't be a big problem, but if anyone is unable to reproduce the claimed result, this may be the cause.