microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3k stars 201 forks source link

Adding sqrt in the recurrent_forward of retnet to make it consistent with parallel_forward #50

Closed wangmengzhi closed 1 year ago

wangmengzhi commented 1 year ago

Adding sqrt in the recurrent_forward of retnet can avoid numerical underflow thus improving consistency and performance. https://github.com/microsoft/torchscale/issues/47