Open AndreaChiChengdu opened 2 months ago
This is the data we profiled on OnePlus 12 (Snapdragon 8 GEN 3) with high performance mode.
T-MAC | llama.cpp | NPU (claimed) | |
---|---|---|---|
llama-2-7b-2bit (NT=1) | 8.05 | 3.16 | |
llama-2-7b-2bit (NT=2) | 10.00 | 3.76 | |
llama-2-7b-2bit (NT=3) | 13.76 | 5.43 | |
llama-2-7b-2bit (NT=4) | 16.62 | 6.95 | |
llama-2-7b-4bit (NT=1) | 4.43 | 3.44 | 11.3 |
llama-2-7b-4bit (NT=2) | 5.82 | 4.67 | 11.3 |
llama-2-7b-4bit (NT=3) | 8.20 | 6.66 | 11.3 |
llama-2-7b-4bit (NT=4) | 10.19 | 8.24 | 11.3 |
8 GEN 3 is more complex compared to X Elite due to its big.LITTLE architecture. he CPU frequency and achieved memory bandwidth differ between the big core and the little core, and most importantly, the CPI (clock per instruction) of LUT or FMA instructions varies between big cores and little cores.
Meanwhile, the task scheduling of llama.cpp threadpool is suboptimal. It will assign the same amount of computations to each core, so it fails to fully utilize the big core under multi-threading. We are currently conducting some low-level profiling and will resolve this issue.
13.35
thank you very much,the data is very useful to me. i found this repo is base on b2854 and in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel, it looks useful
in lastest version of llama.cpp will use openmp to accelerate multi-thread parallel
Thanks for the info. We are working on merging the latest llama.cpp
@AndreaChiChengdu I've added the updated 2-bit T-MAC data to the table above (due to some profiling issues last time, including overheating and interference of tvmrpc_release.apk). All other results have been successfully reproduced and are as expected. The speedup of 2-bit T-MAC compared to 4-bit T-MAC is now as anticipated (i.e., a 2x speedup). The remaining issue is thread scheduling on Android. I'll address this by merging the latest llama.cpp openmp.
To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:
ith elapsed 0: 95
ith elapsed 3: 160
ith elapsed 2: 161
ith elapsed 1: 160
ith elapsed 0: 84
ith elapsed 2: 161
ith elapsed 3: 162
ith elapsed 1: 162
ith elapsed 0: 207
ith elapsed 3: 430
ith elapsed 1: 431
ith elapsed 2: 431
We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.
To provide more details here, here is some low-level profiling on Android 8 GEN 3. We output the elapsed time of each T-MAC mpGEMM kernel on each thread. The unit is us:
ith elapsed 0: 95 ith elapsed 3: 160 ith elapsed 2: 161 ith elapsed 1: 160 ith elapsed 0: 84 ith elapsed 2: 161 ith elapsed 3: 162 ith elapsed 1: 162 ith elapsed 0: 207 ith elapsed 3: 430 ith elapsed 1: 431 ith elapsed 2: 431
We can clearly observe that the main thread (on big core) only need ~1/2 latency of medium cores. The ith=0 will busy waiting other cores to complete. Hopefully this issue will be resolved in latest llama.cpp.
@kaleid-liner Yes, I found the same problem. In addition, the paper mentioned 3bit, but the current engineering practice is mainly 2bit and 4bit, what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf? thanks!
@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.
what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?
The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).
@AndreaChiChengdu The lack of 3-bit in common practice, from our insights, is due to technique difficulty in packing 3-bit in bytes and decoding efficiently. Most model developers will also assume 3-bit is not good for inference, so they won't try 3-bit at all. However, T-MAC has solved this problem by bit-wise lookup and can achieve linear speedup for 3-bit. EfficientQAT has already provided 3-bit models, and the tradeoff between accuracy and model size is pretty good.
what is the situation of 3bit and the comparison benchmark is which Q3 variant of gguf?
The baseline is llama.cpp Q3_K, which is the fastest 3-bit implementation (at least for the version we use, about 2 months ago).
thanks, but in this project from now on , it looks that -m not support llama-2-7b-3bit
@AndreaChiChengdu Yes, cause our integration supports most models through GPTQ format, which currently doesn't provide 3-bit format. We just need a standardized 3-bit packing format. Maybe I can try EQAT 3-bit format.
hi there, I am using a 8Gen3(Xiaomi14 Pro 68GB/s bw) and following the Android Cross Compilation Guidance Option.1: Use Prebuilt Kernels guide to test llama-2-7b-4bit token generation performance. it looks that the t-mac cpu performance is worse that NPU,where can i optimize? thanks
p.s. 1.the phone battery is above 80% with high performance mode. the phone geekbench/ludashi benchmark score are right in 8gen3 range. 2.cmd: python tools/run_pipeline.py -o ~/andreaji/condatmac/T-MAC/3rdparty/llama.cpp/Llama-2-7b-EfficientQAT-w4g128-GPTQ -m llama-2-7b-4bit -d android -ndk $NDK_HOME -u 3.my changes in run_pipeline.py is that add the prompt from 24 token to 256 token words.
Framework Model NUM_THREADS Throughput (tokens/sec) T-MAC (CPU) llama-2-7b (W4) 2 my data only 4.46 token/s at -n 128 T-MAC (CPU) llama-2-7b (W4) 4 my data only 6.61~8.2 token/s at -n 128 NPE (NPU) llama-2-7b (W4) - 11.3 in qualcomm aihub near the xelite 10.3