philipturner / metal-benchmarks

Apple GPU microarchitecture
MIT License
454 stars 16 forks source link

[Question]: 100% 2-way bank-conflict on apple 7\8 when loading float data? #2

Open FdyCN opened 9 months ago

FdyCN commented 9 months ago

image

as the image shows, apple 7\8 has 16 bank and each bank size is 4B, while warp-size(or simdgroup_size) is 32. So when we loading 1 float per thread in the same warp, these two threads: [lane_id] and [lane_id + 16] , are always occur bank conflict, in other words, 2-way bank-conflict?

am i right ? or not and why?

philipturner commented 9 months ago

Bank accesses is granted per-warp. If threads all access the same location in threadgroup memory, the data transfer is fused. This should be the case across all GPU architectures. There's so many places where source code would read the same location in memory from each thread, I can't imagine it not being optimized in hardware.

Anywhere else, for example different addresses in the same bank, should cause an extra stall.