Closed qedawkins closed 11 months ago
I don't see a way to assign reviewers, so @antiagainst @kuhar I am posting progress here as discussed offline.
Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).
That's (4096 LHS + 4096 4096 RHS + 4096 4 OUTPUT) bytes / (20 * 10^(-6)) s ~= 0.84 TB/s? The theoritcal peak is 3.5 TB/s. So that's still quite far from it. So memory access is still not best. May need to dump ISA and see if there is anything suspicious and grab RGP traces to check.
Closing this in favor of #40. If any of the other strategies tried here seem relevant at a later point I will open a new PR on top.
This adds benchmarks for
vmt
, with very similar supporting structure to the existingmmt
benchmark, but with different strategies tuned for matvec. This add three strategies:1) Treat it like a reduction with one workgroup per row, relying on cache to get reuse of the vector. 2) Copy the vector to shared memory using all threads in the workgroup and then process N0 rows per workgroup, with WG_Y | N0 threadgroups. 3) Use a fixed number of workgroups and each workgroup strides the full problem space. This should limit the overhead of setting up the vector in shared memory, as well as improves scheduling overhead.
Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).