Add benchmark sample for vector times matrix transposed

qedawkins commented 1 year ago

This adds benchmarks for vmt, with very similar supporting structure to the existing mmt benchmark, but with different strategies tuned for matvec. This add three strategies:

1) Treat it like a reduction with one workgroup per row, relying on cache to get reuse of the vector. 2) Copy the vector to shared memory using all threads in the workgroup and then process N0 rows per workgroup, with WG_Y | N0 threadgroups. 3) Use a fixed number of workgroups and each workgroup strides the full problem space. This should limit the overhead of setting up the vector in shared memory, as well as improves scheduling overhead.

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

qedawkins commented 1 year ago

I don't see a way to assign reviewers, so @antiagainst @kuhar I am posting progress here as discussed offline.

antiagainst commented 1 year ago

Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).

That's (4096 LHS + 4096 4096 RHS + 4096 4 OUTPUT) bytes / (20 * 10^(-6)) s ~= 0.84 TB/s? The theoritcal peak is 3.5 TB/s. So that's still quite far from it. So memory access is still not best. May need to dump ISA and see if there is anything suspicious and grab RGP traces to check.

qedawkins commented 11 months ago

Closing this in favor of #40. If any of the other strategies tried here seem relevant at a later point I will open a new PR on top.

google / uVkCompute

Add benchmark sample for vector times matrix transposed #38