How to test kernel performance using your code?

orange-juice1 commented 2 months ago

I tested the end-to-end output, but the performance does not match the results mentioned in the paper. I would like to test the kernel performance mentioned in the paper on my own machine. Is there any script that can achieve this functionality, specifically the results shown in Figures 6 and 7 of the paper? Thank you :) Nice work 👍

kaleid-liner commented 2 months ago

You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.

orange-juice1 commented 2 months ago

Thank you for your prompt reply and guidance. I will try to see how the kernel performs to see where the problem lies.

kevinoldching commented 2 months ago

You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.

How to test the kernel performance in llama.cpp? Thks.

kaleid-liner commented 2 months ago

You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.

How to test the kernel performance in llama.cpp? Thks.

Use llama.cpp/tests/test-backend-ops. The binary will be generated after make or make tests under llama.cpp/build. Please specify cmake -DLLAMA_TMAC=OFF while building for baseline.

orange-juice1 commented 2 months ago

You should use llama-bench for e2e benchmarking. We have also provided tools/bench_e2e.py as a helper script. To profile kernel-level results, you can use tools/profile.py, which is used for figure 6 and figure 7.

How to test the kernel performance in llama.cpp? Thks.

Use llama.cpp/tests/test-backend-ops. The binary will be generated after make or make tests under llama.cpp/build. Please specify cmake -DLLAMA_TMAC=OFF while building for baseline.

When I tried running profile.py, I saw the following message: Cannot find config for target=llvm -keys=cpu -mcpu=core-avx2 -mtriple=x86_64-unknown-linux-gnu, workload=('qgemm_lut_t1_int8_m12288_k4096_n1_b3', 12288, 1, 4096). A fallback configuration is used, which may bring great performance regression. However, I still achieved kernel performance that is quite close to the results in the paper. But my e2e performance is still much worse than the paper’s results. What could be the cause of this? Thks. Besides, to test the kernel performance in llama.cpp, can I simply run the file llama.cpp/tests/test-backend-ops directly?

kaleid-liner commented 2 months ago

@orange-juice1 It indicates that the performance is bottlenecked by memory bandwidth. The new-gen edge devices have higher and higher memory bandwidth, and thus fully leverage the computational efficiency of t-mac. As stated in README.md,

Note: We have noticed many users attempting to evaluate T-MAC on old-gen x86 platforms. However, x86 CPUs vary dramatically, and due to unawareness of AI workloads, most of these platforms have extremely low memory bandwidth (even lower than Raspberry Pi 5). Our current tests do not encompass all x86 platforms, particularly older generations. As a result, we cannot guarantee significant speedup (especially for 4-bit token generation) on all x86 platforms. We recommend Surface Book 3 or ARM devices to evaluate T-MAC.

orange-juice1 commented 2 months ago

Thank you very much for your prompt response. However, I am still confused about how to test the kernel performance of llama.cpp. I believe it would be better if I could measure the kernel performance of llama.cpp for comparison.

kaleid-liner commented 2 months ago

@orange-juice1 Sure, you can run tests/test-backend-ops. test-backend-ops perf -o mul_mat should work.

orange-juice1 commented 2 months ago

In fact, I ran test-backend-ops perf -o mul_mat, but I found that there were no matrices with the same shape in the results, such as [4096,4096,1] in tmac. :(

kaleid-liner commented 2 months ago

@orange-juice1 You can edit the code to add shapes.

orange-juice1 commented 2 months ago

Thx. From the results I obtained: MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=4096,bs=[1,1],nr=[1,1]): 449 runs - 352.27 us/run - 74768 kB/run - 202.41 GB/s
MUL_MAT(type_a=q4_0,type_b=f32,m=11008,n=1,k=4096,bs=[1,1],nr=[1,1]): 167 runs - 980.84 us/run - 200939 kB/run - 195.37 GB/s
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=11008,bs=[1,1],nr=[1,1]): 168 runs - 959.02 us/run - 200912 kB/run - 199.79 GB/s. I noticed that the speed is very fast, faster than TMAC, and I’m not sure if I did something wrong. Can these results be directly compared with the 4-bit case in TMAC for matrices of the same size? Because for the case where m=4096, n=1, k=4096, and bit=4, I got 0.69ms in TMAC.

kaleid-liner commented 2 months ago

test-backend-ops will use all cpu cores by default. Have you made the num_threads consistent?

orange-juice1 commented 2 months ago

Yes, I also realized this issue, and I am currently looking into how to set the threads for test-backend-ops. Sorry about that. Thank you so much :) I set nt=1 in runpipeline, and I thought that would work 🔢 .

kaleid-liner commented 2 months ago

@orange-juice1 Great, you can check our modifications to test-backend-ops. We have provided a environment variable to set the num_threads. https://github.com/kaleid-liner/llama.cpp/commit/b995bfdac2a8000f9bcb08ea1b7a15bf77a8089a

orange-juice1 commented 2 months ago

Yes, I just found it right when you replied to me. If I modify it, do I need to recompile in the pipeline and then run llama.cpp again? I think that's the case. Thank you so much. 👍

kaleid-liner commented 2 months ago

You only need to recompile the llama.cpp in llama.cpp/build by make tests.

orange-juice1 commented 2 months ago

Thank you for your guidance, I have conducted multiple tests. However, I noticed that when testing LLAMA.CPP, the operation was MUL_MAT(type_a=q2_K, type_b=f32, m=4096, n=1, k=4096, bs=[1,1], nr=[1,1]). When I changed type_b to FP16, it showed "not supported." But based on my understanding, tmac uses FP16. So I would like to ask, in the comparison results, if LLAMA also used FP16, how was it achieved?

kaleid-liner commented 2 months ago

Thank you for your guidance, I have conducted multiple tests. However, I noticed that when testing LLAMA.CPP, the operation was MUL_MAT(type_a=q2_K, type_b=f32, m=4096, n=1, k=4096, bs=[1,1], nr=[1,1]). When I changed type_b to FP16, it showed "not supported." But based on my understanding, tmac uses FP16. So I would like to ask, in the comparison results, if LLAMA also used FP16, how was it achieved?

Llama.cpp will quantize the fp32 activation to int8 for further computation. T-MAC also applies this quantization. FP16 of T-MAC is used for scale multiplication. The inputs and outputs of T-MAC kernel in llama.cpp are still fp32.

orange-juice1 commented 2 months ago

So this result is the same as the results in Fig 6 and 7. Thank you very much. 👍

kaleid-liner commented 2 months ago

@orange-juice1 Great! I will close this issue as completed then.

microsoft / T-MAC

How to test kernel performance using your code? #44