are marline and q4k totally equivalent?

kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Apache License 2.0

741 stars 39 forks source link

are marline and q4k totally equivalent? #87

Closed Eutenacity closed 1 month ago

Eutenacity commented 2 months ago

sorry i am not familiar with quantization. i found you transfer q4k into fp32 and then quantize to 4 bit with marline. so i am curious whether marline and q4k are totally equivalent? dose this kind of transfer affect accuracy

Azure-Tang commented 2 months ago

This approach will inevitably result in some loss of precision, but I believe it won’t have a significant impact, especially since we’re already using Q4 quantization.

Eutenacity commented 2 months ago

ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.

Azure-Tang commented 2 months ago

ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.

The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.

Eutenacity commented 1 month ago

ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.

The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.

I try to use ggml matmul in the pytorch. I put my code in the https://github.com/Eutenacity/python-ggml-matmul I hope this can help. But I can not use PyTorch cudaprah to capture the matmul.

Azure-Tang commented 1 month ago

ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.

The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.

I try to use ggml matmul in the pytorch. I put my code in the https://github.com/Eutenacity/python-ggml-matmul I hope this can help. But I can not use PyTorch cudaprah to capture the matmul.

Not being able to use cudagraph significantly slows down the generation speed. Let’s see if we can figure this out.