Closed orange-juice1 closed 2 months ago
You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.
Thank you for your prompt reply and guidance. I will try to see how the kernel performs to see where the problem lies.
You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.
How to test the kernel performance in llama.cpp? Thks.
You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.
How to test the kernel performance in llama.cpp? Thks.
Use llama.cpp/tests/test-backend-ops
. The binary will be generated after make
or make tests
under llama.cpp/build
. Please specify cmake -DLLAMA_TMAC=OFF
while building for baseline.
You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.
How to test the kernel performance in llama.cpp? Thks.
Use
llama.cpp/tests/test-backend-ops
. The binary will be generated aftermake
ormake tests
underllama.cpp/build
. Please specifycmake -DLLAMA_TMAC=OFF
while building for baseline.
When I tried running profile.py, I saw the following message: Cannot find config for target=llvm -keys=cpu -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu, workload=('qgemm_lut_t1_int8_m12288_k4096_n1_b3', 12288, 1, 4096). A fallback configuration is used, which may bring great performance regression. However, I still achieved kernel performance that is quite close to the results in the paper. But my e2e performance is still much worse than the paperβs results. What could be the cause of this? Thks. Besides, to test the kernel performance in llama.cpp, can I simply run the file llama.cpp/tests/test-backend-ops directly?
@orange-juice1 It indicates that the performance is bottlenecked by memory bandwidth. The new-gen edge devices have higher and higher memory bandwidth, and thus fully leverage the computational efficiency of t-mac. As stated in README.md,
Note: We have noticed many users attempting to evaluate T-MAC on old-gen x86 platforms. However, x86 CPUs vary dramatically, and due to unawareness of AI workloads, most of these platforms have extremely low memory bandwidth (even lower than Raspberry Pi 5). Our current tests do not encompass all x86 platforms, particularly older generations. As a result, we cannot guarantee significant speedup (especially for 4-bit token generation) on all x86 platforms. We recommend Surface Book 3 or ARM devices to evaluate T-MAC.
Thank you very much for your prompt response. However, I am still confused about how to test the kernel performance of llama.cpp. I believe it would be better if I could measure the kernel performance of llama.cpp for comparison.
@orange-juice1 Sure, you can run tests/test-backend-ops. test-backend-ops perf -o mul_mat
should work.
In fact, I ran test-backend-ops perf -o mul_mat, but I found that there were no matrices with the same shape in the results, such as [4096,4096,1] in tmac. :(
@orange-juice1 You can edit the code to add shapes.
Thx. From the results I obtained: MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=4096,bs=[1,1],nr=[1,1]): 449 runs - 352.27 us/run - 74768 kB/run - 202.41 GB/s
MUL_MAT(type_a=q4_0,type_b=f32,m=11008,n=1,k=4096,bs=[1,1],nr=[1,1]): 167 runs - 980.84 us/run - 200939 kB/run - 195.37 GB/s
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=11008,bs=[1,1],nr=[1,1]): 168 runs - 959.02 us/run - 200912 kB/run - 199.79 GB/s. I noticed that the speed is very fast, faster than TMAC, and Iβm not sure if I did something wrong. Can these results be directly compared with the 4-bit case in TMAC for matrices of the same size? Because for the case where m=4096, n=1, k=4096, and bit=4, I got 0.69ms in TMAC.
test-backend-ops will use all cpu cores by default. Have you made the num_threads consistent?
Yes, I also realized this issue, and I am currently looking into how to set the threads for test-backend-ops. Sorry about that. Thank you so much :) I set nt=1 in runpipeline, and I thought that would work π’ .
@orange-juice1 Great, you can check our modifications to test-backend-ops. We have provided a environment variable to set the num_threads. https://github.com/kaleid-liner/llama.cpp/commit/b995bfdac2a8000f9bcb08ea1b7a15bf77a8089a
Yes, I just found it right when you replied to me. If I modify it, do I need to recompile in the pipeline and then run llama.cpp again? I think that's the case. Thank you so much. π
You only need to recompile the llama.cpp in llama.cpp/build
by make tests
.
Thank you for your guidance, I have conducted multiple tests. However, I noticed that when testing LLAMA.CPP, the operation was MUL_MAT(type_a=q2_K, type_b=f32, m=4096, n=1, k=4096, bs=[1,1], nr=[1,1]). When I changed type_b to FP16, it showed "not supported." But based on my understanding, tmac uses FP16. So I would like to ask, in the comparison results, if LLAMA also used FP16, how was it achieved?
Thank you for your guidance, I have conducted multiple tests. However, I noticed that when testing LLAMA.CPP, the operation was MUL_MAT(type_a=q2_K, type_b=f32, m=4096, n=1, k=4096, bs=[1,1], nr=[1,1]). When I changed type_b to FP16, it showed "not supported." But based on my understanding, tmac uses FP16. So I would like to ask, in the comparison results, if LLAMA also used FP16, how was it achieved?
Llama.cpp will quantize the fp32 activation to int8 for further computation. T-MAC also applies this quantization. FP16 of T-MAC is used for scale multiplication. The inputs and outputs of T-MAC kernel in llama.cpp are still fp32.
So this result is the same as the results in Fig 6 and 7. Thank you very much. π
@orange-juice1 Great! I will close this issue as completed then.
I tested the end-to-end output, but the performance does not match the results mentioned in the paper. I would like to test the kernel performance mentioned in the paper on my own machine. Is there any script that can achieve this functionality, specifically the results shown in Figures 6 and 7 of the paper? Thank you :) Nice work π