Closed deltaguo closed 9 months ago
Torch should automatically default to HIPifying the extension code when running on ROCm. The USE_ROCM
flag should be set in the C++/CUDA sources.
As for launching the kernel, it's hard to speculate as to why it isn't running, but keep in mind it takes three template parameters:
template<bool use_half2, bool use_groupsize, bool use_x_map>
__global__ void q4_matmul_kernel
use_x_map
is set to false normally, unless you're running with the matmul_fused_remap
flag set in the model's config. So the x_map
pointer would normally be NULL, and the buffer wouldn't be used because the columns in the input matrix have already been shuffled into place by the call to column_remap_cuda
.
Thank you for your answer. What I am doing is to replace q4_matmul_kernel with the operator I implemented. My operator may have calculation errors. The error output by each linear layer is passed to the subsequent input x.
Hello, I am studying related work. My device is AMD MI210. I found that the inference speed of LLaMA-13B on exllama is only about 24 t/s, and q4_matmul_kernel seems to have a lot of room for improvement, so I try to use my_q4_matmul_kernel for exllama, but encountered a problem.
I simply replaced the API and only changed the (grid, block) parameters outside the kernel function, but found that the operator I implemented was not called during execution. If I delete alt_stream in the configuration, I find that my operator is called, but an error occurs when reading x_mapped that all of the element in x_mapped are nan. I'm not sure what I overlooked.
My question :
I'm really looking forward to your answer, I will be very grateful!