Closed zhewang1-intc closed 1 month ago
You can use tools/profile.py
as https://github.com/microsoft/T-MAC/issues/44#issuecomment-2349601221
If you still want to conduct it in llama.cpp, you need to achieve it in a more hacky way, by adding GGML_TYPE_Q2_K
to https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L379 and https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L72
Hi @kaleid-liner Thank you for your response.
If you still want to conduct it in llama.cpp, you need to achieve it in a more hacky way, by adding
GGML_TYPE_Q2_K
to https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L379 and https://github.com/kaleid-liner/llama.cpp/blob/70c312d654539860b4839e7851432b75813edaa1/ggml-tmac.cpp#L72
I tried the hacky way with llama.cpp, but unfortunately, the program threw a segmentation fault in the ggml_compute_forward_mul_mat
function, and I haven’t had the chance to look into it closely yet.
You can use tools/profile.py as https://github.com/microsoft/T-MAC/issues/44#issuecomment-2349601221
As for running the profile.py file, I have the following questions:
qgemm_lut
column in the CSV data exported by profile.py include the time consumed by generate preprocess_LUT
? If not, would this result in overly optimistic performance measurements? After all, when the activation matrix changes, we need to rebuild the LUT.@zhewang1-intc
I'm not sure about the cause of segmentation fault, as I used to profile the kernel in llama.cpp in this way. Another option is to add I2 support to test-backend-ops. It does require some efforts tho.
-k preprocessor
during profile.py and add it. However, preprocessor only occupies ~1% of the total latency.thanks, I got the reasonable result, I will close this issue.
Hi, thank you for your outstanding work.
I am currently trying to profile the kernel-level performance of 2bit T-MAC GEMM in llama.cpp.
From this issue, I learned that I can use the
test-backend-ops
provided by llama.cpp for benchmarking.However, based on my experiments, when running the
llama-3-8b-2bit
model, the weight’sggml_type
in llama.cpp isGGML_TYPE_I2
. However, there is no test case for this type intest-backend-ops
. When I addedGGML_TYPE_I2
to the ggml_type list to be tested, the program triggered an assertion error:GGML_ASSERT: /home/gta/T-MAC/3rdparty/llama.cpp/ggml.c:3153: view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)
How can I quickly complete the kernel-level performance testing of 2bit T-MAC GEMM in llama.cpp?