Open FdyCN opened 9 months ago
Bank accesses is granted per-warp. If threads all access the same location in threadgroup memory, the data transfer is fused. This should be the case across all GPU architectures. There's so many places where source code would read the same location in memory from each thread, I can't imagine it not being optimized in hardware.
Anywhere else, for example different addresses in the same bank, should cause an extra stall.
as the image shows, apple 7\8 has 16 bank and each bank size is 4B, while warp-size(or simdgroup_size) is 32. So when we loading 1 float per thread in the same warp, these two threads: [lane_id] and [lane_id + 16] , are always occur bank conflict, in other words, 2-way bank-conflict?
am i right ? or not and why?