Closed Eutenacity closed 1 month ago
This approach will inevitably result in some loss of precision, but I believe it won’t have a significant impact, especially since we’re already using Q4 quantization.
ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.
ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.
The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.
ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.
The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.
I try to use ggml matmul in the pytorch. I put my code in the https://github.com/Eutenacity/python-ggml-matmul I hope this can help. But I can not use PyTorch cudaprah to capture the matmul.
ggml also provide fast matmul cuda kernel. Why not use it which is exactly fit with q4_k, q4_m and so on.
The challenge is that GGML’s matrix multiplication is deeply intertwined with its memory allocation system, making it difficult to extract and reuse separately.
I try to use ggml matmul in the pytorch. I put my code in the https://github.com/Eutenacity/python-ggml-matmul I hope this can help. But I can not use PyTorch cudaprah to capture the matmul.
Not being able to use cudagraph significantly slows down the generation speed. Let’s see if we can figure this out.
sorry i am not familiar with quantization. i found you transfer q4k into fp32 and then quantize to 4 bit with marline. so i am curious whether marline and q4k are totally equivalent? dose this kind of transfer affect accuracy