microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
2.98k stars 201 forks source link

About training memory #75

Closed HoraceXIaoyiBao closed 9 months ago

HoraceXIaoyiBao commented 9 months ago

Hi, in your Retnet paper table4, the naiive transformer 1.3B model cost more gpu memory than 2.7B model, could you please explain why?

HoraceXIaoyiBao commented 9 months ago

Besides, could you please tell me what is the position encoding method used for the naiive transformer in the paper "Retentive Network: A Successor to Transformer for Large Language Models", learnable embedding or rotary encoding or Xpos?

I really appreciate the help you offer !

sunyt32 commented 9 months ago
  1. Out vocab size is more than 100,000. So the overall parameters are more than $2048^2\times12\times24+2048\times100000>1,400,000,000$.
  2. We use xPos as the baseline position embedding for Transformers.