siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch
https://siboehm.com/articles/22/CUDA-MMM
MIT License
412 stars 53 forks source link

Solve bank conflict #8

Open yofufufufu opened 3 months ago

yofufufufu commented 3 months ago

In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory. For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict. So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs: https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/8_kernel_bank_extra_col.cuh#L56-L60 Did I understand something wrong?

yofufufufu commented 3 months ago

Still looking forward to your reply 😿