Adding sqrt in the recurrent_forward of retnet to make it consistent with parallel_forward

microsoft / torchscale

Foundation Architecture for (M)LLMs

https://aka.ms/GeneralAI

MIT License

3k stars 201 forks source link

Closed wangmengzhi closed 1 year ago

wangmengzhi commented 1 year ago

Adding sqrt in the recurrent_forward of retnet can avoid numerical underflow thus improving consistency and performance. https://github.com/microsoft/torchscale/issues/47