microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
422 stars 32 forks source link

How to Fully Utilize the Optimized Performance of T-MAC ? #30

Open ma-hang opened 2 weeks ago

ma-hang commented 2 weeks ago

I followed the documentation to run the llama2-7b model (4-bit quantized) and also ran it on llama.cpp for comparison. I noticed that, except for nt=1, where there was a slight performance improvement, the performance with nt=4/8 was actually worse than with llama.cpp. The command and parameters used were: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 1. It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase.

Output sample: python tools/run_pipeline.py -o $HOME/tmactest/T-MAC/model2 -m llama-2-7b-4bit -nt 4 Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington. Home to some of the m>Microsoft Office 365 (MSO365) is the world’s most popular office suite, used by more than 180 million users. Microsoft Office 365 is a cloud-based su>Microsoft Office 365 is a cloud-based suite of productivity applications that includes Microsoft Office, Exchange, SharePoint, and Skype for Business>llama_print_timings: load time = 600.02 ms llama_print_timings: sample time = 3.13 ms / 128 runs ( 0.02 ms per token, 40920.72 tokens per second) llama_print_timings: prompt eval time = 1213.28 ms / 24 tokens ( 50.55 ms per token, 19.78 tokens per second) llama_print_timings: eval time = 19007.30 ms / 127 runs ( 149.66 ms per token, 6.68 tokens per second) llama_print_timings: total time = 20237.61 ms / 151 tokens Log end

kaleid-liner commented 2 weeks ago

Are you using devices that are included in our profiling? If not, could you share the specifics of your platform? Based on our observations on some old generation devices (particularly AVX2 CPUs), there could be several potential causes:

  1. Restricted memory bandwidth: If the memory bandwidth of the platform being tested is extremely low (for instance, 10\~30 GB/s), the inference will be completely memory bottlenecked. This scenario can occur on older PCs equipped with 1\~2 channel DDR4/DDR5 memory.

  2. For Intel CPUs prior to Icelake, the CPI of the pshuf instruction is twice as slow (see here), which could harm the performance of T-MAC.

However, modern edge devices come equipped with higher and higher memory bandwidth. For example, up to 74 GB/s for mobile phones equipped with Snapdragon 8 GEN 3, 135GB/s for laptop equipped with Snapdragon X Elite, and even 800GB/s for M2-Ultra. Moreover, even on the old generation devices mentioned above, T-MAC should still offer a significant speedup for 2-bit.

kaleid-liner commented 2 weeks ago

It's also worth mentioning that while there was a significant performance improvement during the prefill phase, there was no such improvement during the decode phase

This is a clear clue that the decoding is bottlenecked by the memory bandwidth.