microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
595 stars 45 forks source link

Why is there no difference in the E2E performance of T-MAC and llama.cpp on arm machine? #61

Open ppp-max opened 1 month ago

ppp-max commented 1 month ago

I used a ARM machine to test the end-to-end output, but the performance does not match the results mentioned in the paper. The tested data of llama.cpp and T-MAC is nearly same. I've posted the measured data below. Image Image And the frequency of this machine is 2.5 GHz, the bandwidth of this machine 680 G/s per core.

kaleid-liner commented 1 month ago

Is 680 G/s memory bandwidth? It seems invalid. You also didn't post the data of llama.cpp. It would be more helpful if you provide the model architecture , whether 4bit or 2bit, and device name.

ppp-max commented 1 month ago

Sorry, the data was pasted wrong. Here‘s llama.cpp's data which used model bitnet_b1_58-3B and thread 4. Image Image And then I tested Llama-2-7b-EfficientQAT-w2g128-GPTQ、Llama-2-7b-EfficientQAT-w4g128-GPTQ, which have the same results(there is no difference of the E2E performance between T-MAC and llama.cpp) And I computed the bandwidth of this machine again,whis is 340 G/s. Sorry about that. Look forward to your reply. Thk.

QingtaoLi1 commented 1 week ago

@ppp-max Your speed is quite low while the memory bandwidth is strangely high. May I double check that 340 is G bits or G bytes? The speed you provide is close to our Raspberry Pi, while its memory bandwidth is only about 48 GB/s. And do you see obvious speed gap between T-MAC and llama.cpp using one single thread? If that's the case, we tend to consider that the 4 threads case meets memory bound, as the roofline model we show in our main page,