turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

[Question] `h_gemm` Implementation #180

Closed jeromeku closed 8 months ago

jeromeku commented 9 months ago

Looking at the h_gemm cuda kernel here, a bit confused as to how the final entry of the C matrix is calculated.

At the end of the kernel, after the intermediate results have been written to the shared buffer accum, a final reduction step is done across dim k, which corresponds to threadblock dim y and the result is stored in c_ptr at m,n, which correspond to the thread block dim z and x respectively.

However, within the final reduction step, won't all threads with different threadIdx.z but same threadIdx.x store the same accumulated result (i.e., c[0][n] == c[1][n] == ... == c[M][n] since the accumulated result is only a function of threadIdx.x?

turboderp commented 9 months ago

It's not very neatly written, and I do mean to revisit it and write something a little faster. This version was just what I ended up with after some tinkering to get something at least a little faster than cuBLAS, and I just never cleaned up the code.

So, the deal is that it's always run with blockDim.z = 1 so blockIdx.z = 0, which means m index is constant within each thread block.

If it wasn't, the shared buffer would indeed need an extra dimension.

I've added some bindings for the function and a test script in the latest commit, if you want to play around with it. Also found a little bug in the process, although it's in the other kernel. Anyway, feedback is appreciated.

jeromeku commented 9 months ago

@turboderp

Thanks for the quick response. So each block is essentially calculating a row of the matrix C, so gridDim.y maps to M (i.e., blockIdx.y == 0 -> row 0, ..., blockIdx.y == M -> row M).

In any case, happy to help add a benchmarking suite for the kernels as well as additional testing tools.