turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

exception about replacing the op q4_matmul_kernel #285

Closed deltaguo closed 9 months ago

deltaguo commented 9 months ago

Hello, I am studying related work. My device is AMD MI210. I found that the inference speed of LLaMA-13B on exllama is only about 24 t/s, and q4_matmul_kernel seems to have a lot of room for improvement, so I try to use my_q4_matmul_kernel for exllama, but encountered a problem.

I simply replaced the API and only changed the (grid, block) parameters outside the kernel function, but found that the operator I implemented was not called during execution. If I delete alt_stream in the configuration, I find that my operator is called, but an error occurs when reading x_mapped that all of the element in x_mapped are nan. I'm not sure what I overlooked.

My question :

  1. How to convert CUDA to hip when the framework is running. I didn’t find any settings related to hipify in the source code.
  2. If I want to connect my operator to exllama based on ROCm, what should I do to succeed.

I'm really looking forward to your answer, I will be very grateful!

turboderp commented 9 months ago

Torch should automatically default to HIPifying the extension code when running on ROCm. The USE_ROCM flag should be set in the C++/CUDA sources.

As for launching the kernel, it's hard to speculate as to why it isn't running, but keep in mind it takes three template parameters:

template<bool use_half2, bool use_groupsize, bool use_x_map>
__global__ void q4_matmul_kernel

use_x_map is set to false normally, unless you're running with the matmul_fused_remap flag set in the model's config. So the x_map pointer would normally be NULL, and the buffer wouldn't be used because the columns in the input matrix have already been shuffled into place by the call to column_remap_cuda.

deltaguo commented 9 months ago

Thank you for your answer. What I am doing is to replace q4_matmul_kernel with the operator I implemented. My operator may have calculation errors. The error output by each linear layer is passed to the subsequent input x.