microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

don't need attn weight in decoder #8

Closed buaahsh closed 1 year ago

buaahsh commented 1 year ago

Attention weight was originally designed for alignment models, so it is not necessary to include it in torchscale.