In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled, AsyncCopyGLobalToLocal's src gmem blocked encoding will have sizePerThread > smem view's vec (along the contiguous dimension). This will resulting in multiple cp.async instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example.
I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for AsyncCopyGlobalToLocal to calculate the max contiguous copy size. I compare that to the blockEnc's sizePerThread along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former.
When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads.
When A is m-major, this pass is a no-op because copy size == sizePerThread == 16
This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See https://github.com/triton-lang/triton/pull/5003#discussion_r1841730886
In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled,
AsyncCopyGLobalToLocal
's src gmem blocked encoding will havesizePerThread
> smem view'svec
(along the contiguous dimension). This will resulting in multiplecp.async
instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example.I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for
AsyncCopyGlobalToLocal
to calculate the max contiguous copy size. I compare that to the blockEnc'ssizePerThread
along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former.When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads. When A is m-major, this pass is a no-op because
copy size == sizePerThread == 16
ptal, thanks @ThomasRaoux