Closed HoraceXIaoyiBao closed 9 months ago
Besides, could you please tell me what is the position encoding method used for the naiive transformer in the paper "Retentive Network: A Successor to Transformer for Large Language Models", learnable embedding or rotary encoding or Xpos?
I really appreciate the help you offer !
Hi, in your Retnet paper table4, the naiive transformer 1.3B model cost more gpu memory than 2.7B model, could you please explain why?