[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting

This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See https://github.com/triton-lang/triton/pull/5003#discussion_r1841730886

In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled, AsyncCopyGLobalToLocal's src gmem blocked encoding will have sizePerThread > smem view's vec (along the contiguous dimension). This will resulting in multiple cp.async instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example.

I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for AsyncCopyGlobalToLocal to calculate the max contiguous copy size. I compare that to the blockEnc's sizePerThread along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former.

When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads. When A is m-major, this pass is a no-op because copy size == sizePerThread == 16

ptal, thanks @ThomasRaoux

triton-lang / triton

[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting #5222