Hi,
May I ask a simple question: You claim 24K token/s with 1.1B model which is 56% efficiency.
But my cuda code with pure cublas GEMM calls on 2048*2048 matrix fail to reach 56% efficiency.
Note that we also have other operations such as layernorm etc.
So my own 1B GPT with cuda/bfloat16 only reach 10K tokens/s.
It seems that there is a chance to double my speed...
I will appreciate to hear discussion from you!
Yi
Hi, May I ask a simple question: You claim 24K token/s with 1.1B model which is 56% efficiency. But my cuda code with pure cublas GEMM calls on 2048*2048 matrix fail to reach 56% efficiency. Note that we also have other operations such as layernorm etc. So my own 1B GPT with cuda/bfloat16 only reach 10K tokens/s. It seems that there is a chance to double my speed... I will appreciate to hear discussion from you! Yi