don't need attn weight in decoder

microsoft / torchscale

Foundation Architecture for (M)LLMs

https://aka.ms/GeneralAI

MIT License

3.01k stars 202 forks source link

Closed buaahsh closed 1 year ago

buaahsh commented 1 year ago

Attention weight was originally designed for alignment models, so it is not necessary to include it in torchscale.