8gen3 T-MAC cpu performance issue

AndreaChiChengdu commented 2 months ago

hi there, I am using a 8Gen3(Xiaomi14 Pro 68GB/s bw) and following the Android Cross Compilation Guidance Option.1: Use Prebuilt Kernels guide to test llama-2-7b-4bit token generation performance. it looks that the t-mac cpu performance is worse that NPU,where can i optimize? thanks

p.s. 1.the phone battery is above 80% with high performance mode. the phone geekbench/ludashi benchmark score are right in 8gen3 range. 2.cmd: python tools/run_pipeline.py -o ~/andreaji/condatmac/T-MAC/3rdparty/llama.cpp/Llama-2-7b-EfficientQAT-w4g128-GPTQ -m llama-2-7b-4bit -d android -ndk $NDK_HOME -u 3.my changes in run_pipeline.py is that add the prompt from 24 token to 256 token words.

Framework Model NUM_THREADS Throughput (tokens/sec) T-MAC (CPU) llama-2-7b (W4) 2 my data only 4.46 token/s at -n 128 T-MAC (CPU) llama-2-7b (W4) 4 my data only 6.61~8.2 token/s at -n 128 NPE (NPU) llama-2-7b (W4) - 11.3 in qualcomm aihub near the xelite 10.3 Screenshot from 2024-08-29 16-07-35

kaleid-liner commented 2 months ago

This is the data we profiled on OnePlus 12 (Snapdragon 8 GEN 3) with high performance mode.

	T-MAC	llama.cpp	NPU (claimed)
llama-2-7b-2bit (NT=1)	8.05	3.16
llama-2-7b-2bit (NT=2)	10.00	3.76
llama-2-7b-2bit (NT=3)	13.76	5.43
llama-2-7b-2bit (NT=4)	16.62	6.95
llama-2-7b-4bit (NT=1)	4.43	3.44	11.3
llama-2-7b-4bit (NT=2)	5.82	4.67	11.3
llama-2-7b-4bit (NT=3)	8.20	6.66	11.3
llama-2-7b-4bit (NT=4)	10.19	8.24	11.3

8 GEN 3 is more complex compared to X Elite due to its big.LITTLE architecture. he CPU frequency and achieved memory bandwidth differ between the big core and the little core, and most importantly, the CPI (clock per instruction) of LUT or FMA instructions varies between big cores and little cores.

Meanwhile, the task scheduling of llama.cpp threadpool is suboptimal. It will assign the same amount of computations to each core, so it fails to fully utilize the big core under multi-threading. We are currently conducting some low-level profiling and will resolve this issue.

AndreaChiChengdu commented 2 months ago

13.35

thank you very much,the data is very useful to me. i found this repo is base on b2854 and in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel, it looks useful

kaleid-liner commented 2 months ago

in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel

Thanks for the info. We are working on merging the latest llama.cpp

kaleid-liner commented 2 months ago

@AndreaChiChengdu I've added the updated 2-bit T-MAC data to the table above (due to some profiling issues last time, including overheating and interference of tvmrpc_release.apk). All other results have been successfully reproduced and are as expected. The speedup of 2-bit T-MAC compared to 4-bit T-MAC is now as anticipated (i.e., a 2x speedup). The remaining issue is thread scheduling on Android. I'll address this by merging the latest llama.cpp openmp.

kaleid-liner commented 2 months ago

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:

ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431

We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

AndreaChiChengdu commented 2 months ago

To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:
ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431
We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.

@kaleid-liner Yes, I found the same problem. In addition, the paper mentioned 3bit, but the current engineering practice is mainly 2bit and 4bit, what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf? thanks!

kaleid-liner commented 2 months ago

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

AndreaChiChengdu commented 2 months ago

@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.

what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?

The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).

Screenshot from 2024-09-03 16-04-53 thanks, but in this project from now on , it looks that -m not support llama-2-7b-3bit

kaleid-liner commented 2 months ago

@AndreaChiChengdu Yes, cause our integration supports most models through GPTQ format, which currently doesn't provide 3-bit format. We just need a standardized 3-bit packing format. Maybe I can try EQAT 3-bit format.

microsoft / T-MAC

8gen3 T-MAC cpu performance issue #32