microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
588 stars 44 forks source link

How to profile the perf of 2bit T-MAC GEMM in llama.cpp #50

Closed zhewang1-intc closed 1 month ago

zhewang1-intc commented 2 months ago

Hi, thank you for your outstanding work.

I am currently trying to profile the kernel-level performance of 2bit T-MAC GEMM in llama.cpp.

From this issue, I learned that I can use the test-backend-ops provided by llama.cpp for benchmarking.

However, based on my experiments, when running the llama-3-8b-2bit model, the weight’s ggml_type in llama.cpp is GGML_TYPE_I2. However, there is no test case for this type in test-backend-ops. When I added GGML_TYPE_I2 to the ggml_type list to be tested, the program triggered an assertion error: GGML_ASSERT: /home/gta/T-MAC/3rdparty/llama.cpp/ggml.c:3153: view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)

How can I quickly complete the kernel-level performance testing of 2bit T-MAC GEMM in llama.cpp?

kaleid-liner commented 2 months ago

You can use tools/profile.py as https://github.com/microsoft/T-MAC/issues/44#issuecomment-2349601221

If you still want to conduct it in llama.cpp, you need to achieve it in a more hacky way, by adding GGML_TYPE_Q2_K to https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L379 and https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L72

zhewang1-intc commented 2 months ago

Hi @kaleid-liner Thank you for your response.

If you still want to conduct it in llama.cpp, you need to achieve it in a more hacky way, by adding GGML_TYPE_Q2_K to https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L379 and https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L72

I tried the hacky way with llama.cpp, but unfortunately, the program threw a segmentation fault in the ggml_compute_forward_mul_mat function, and I haven’t had the chance to look into it closely yet.

You can use tools/profile.py as https://github.com/microsoft/T-MAC/issues/44#issuecomment-2349601221

As for running the profile.py file, I have the following questions:

  1. It seems that we need to specify the problem size for GEMM in profile.py, and then TVM will compile the optimal kernel based on that problem size. However, the compiled kernel at this point should not be exactly the same as the one used in llama.cpp. If I only care about the kernel-level performance in llama.cpp, does the performance data obtained from the profile.py have any guiding significance?
  2. Does the qgemm_lut column in the CSV data exported by profile.py include the time consumed by generate preprocess_LUT? If not, would this result in overly optimistic performance measurements? After all, when the activation matrix changes, we need to rebuild the LUT.
kaleid-liner commented 1 month ago

@zhewang1-intc

I'm not sure about the cause of segmentation fault, as I used to profile the kernel in llama.cpp in this way. Another option is to add I2 support to test-backend-ops. It does require some efforts tho.

  1. According to our profiling, they are consistent.
  2. You can specify -k preprocessor during profile.py and add it. However, preprocessor only occupies ~1% of the total latency.
zhewang1-intc commented 1 month ago

thanks, I got the reasonable result, I will close this issue.