Closed jeromeku closed 8 months ago
It's not very neatly written, and I do mean to revisit it and write something a little faster. This version was just what I ended up with after some tinkering to get something at least a little faster than cuBLAS, and I just never cleaned up the code.
So, the deal is that it's always run with blockDim.z = 1
so blockIdx.z = 0
, which means m index is constant within each thread block.
If it wasn't, the shared buffer would indeed need an extra dimension.
I've added some bindings for the function and a test script in the latest commit, if you want to play around with it. Also found a little bug in the process, although it's in the other kernel. Anyway, feedback is appreciated.
@turboderp
Thanks for the quick response. So each block is essentially calculating a row of the matrix C
, so gridDim.y
maps to M
(i.e., blockIdx.y == 0
-> row 0
, ..., blockIdx.y == M
-> row M
).
In any case, happy to help add a benchmarking suite for the kernels as well as additional testing tools.
Looking at the
h_gemm
cuda kernel here, a bit confused as to how the final entry of theC
matrix is calculated.At the end of the kernel, after the intermediate results have been written to the shared buffer
accum
, a final reduction step is done across dimk
, which corresponds to threadblock dimy
and the result is stored inc_ptr
atm
,n
, which correspond to the thread block dimz
andx
respectively.However, within the final reduction step, won't all threads with different
threadIdx.z
but samethreadIdx.x
store the same accumulated result (i.e.,c[0][n] == c[1][n] == ... == c[M][n]
since the accumulated result is only a function ofthreadIdx.x
?