about the speed - Githubissues

jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Apache License 2.0

7.3k stars 425 forks source link

about the speed #147

Open wangyi-fudan opened 5 months ago

wangyi-fudan commented 5 months ago

Hi, May I ask a simple question: You claim 24K token/s with 1.1B model which is 56% efficiency. But my cuda code with pure cublas GEMM calls on 2048*2048 matrix fail to reach 56% efficiency. Note that we also have other operations such as layernorm etc. So my own 1B GPT with cuda/bfloat16 only reach 10K tokens/s. It seems that there is a chance to double my speed... I will appreciate to hear discussion from you! Yi

ChaosCodes commented 4 months ago

Hi, may I have more information about your GPUs?