Request description

Our current codegen generates sub-optimal indexing code sequence for matmul on GPU as the a part of the code sequence could be hoisted out of the loop by doing operations in a different order.

for instance after optimizations we end up with an IR like:

%138:43 = scf.for %arg0 = %c0 to %c2560 step %c32 iter_args(...) {
    %266 = arith.cmpi slt, %arg0, %c2496 : index
    %267 = nvgpu.ldmatrix %alloc_0[%arg33, %79, %82] {numTiles = 4 : i32, transpose = false} : memref<3x128x32xf16, #gpu.address_space<workgroup>> -> vector<4x2xf16>

Here only %arg33 is loop variant most of the indexing could be loop invariant but we end up with this code in ptx:

        shl.b64         %rd179, %rd321, 12;
        add.s64         %rd180, %rd179, %rd32;
        add.s64         %rd181, %rd180, %rd8;
        shl.b64         %rd182, %rd181, 1;
        add.s64         %rd185, %rd110, %rd182;
        ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r117, %r118, %r119, %r120}, [%rd185+2048];

While we could be have only 1 add and 1 shl since only the most outer index is different.

To reproduce you can compile:

func.func @matmul() -> tensor<2560x2560xf16> {
  %lhs = util.unfoldable_constant dense<1.0> : tensor<2560x2560xf16>
  %rhs = util.unfoldable_constant dense<0.4> : tensor<2560x2560xf16>
  %c0 = arith.constant 0.0 : f16
  %init = tensor.empty() : tensor<2560x2560xf16>
  %CC = linalg.fill ins(%c0 : f16) outs(%init : tensor<2560x2560xf16>) -> tensor<2560x2560xf16>
  %D = linalg.matmul ins(%lhs, %rhs: tensor<2560x2560xf16>, tensor<2560x2560xf16>)
                    outs(%CC: tensor<2560x2560xf16>) -> tensor<2560x2560xf16>
  return %D : tensor<2560x2560xf16>
}

With this command line:

 ../iree-build/tools/iree-compile ~/matmul_f16.mlir  -iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_80 -o matmul.vmfb -iree-codegen-llvmgpu-use-mma-sync

What component(s) does this issue relate to?

No response

Additional context

No response

High level comment: The complexity of the IR that we generate is scary. Most (if not all) is simplified away by the LLVM optimizers, but this makes understanding what is going on difficult.

For instance, we have a few maps in affine.apply with (sub-)expressions looking like a floordiv cst. In these maps, we know that a is >= 0 (a is a GPU thread id, laneid, etc.). However this information is only discovered later within LLVM when looking at the intrinsics (or value ranges that are provided as metadata.)

The result is these a floordiv cst get lowered into a complicated sequence that looks like:

a floordiv b =
  let negative = a < 0 in
  let absolute = negative ? -a - 1 : a in
    let quotient = absolute / b in
      negative ? -quotient - 1 : quotient

I.e., bunch of math plus two selects.

Whereas we could lower that/optimize that directly to:

a floordiv b = a / b

That said, there's always a tradeoff of how much we clean-up the IR before the backend vs. rely on the backend to do the clean-ups.

Anyhow, that doesn't help the problem in hands :).

Strongly agree that what we generate is scary. "LLVM will save us" only works so long as we are going through LLVM and is currently killing us on emitc/VMVX/SPIR-V/MSL. The approach also slows down our compilation as we're pushing a lot more IR through MLIR and giving LLVM a lot more garbage to handle - our outputs are also bloated as I haven't seen LLVM do a good job of handling things and much of our LLVM CPU code is spent indexing. We really need some solutions earlier on before we reach LLVM and where we can guarantee they happen instead of hoping LLVM can fix stuff.

I am confused why this and #12480 and anything around lowering memrefs is considered such a hard problem (at least it seems to be given that it's been a problem in IREE for quite some time)?

The solutions seem pretty mundane to me:

add an interface / refactor to make the upstream lowering of memref to some configurable ABI more pluggable,
add canonicalization/folding patterns where needed to eagerly canonicalize/avoid producing the IR we don't want.

Did usage within IREE find some particularly hard problems or is it just a case of prioritization?

What am I missing?

Adding a quick data point: The actual shl, add, etc. that are surviving codegen come from iree-convert-to-nvvm. %351 is loop variant, other stuff (%367, %371) comes from the affine map lowering:

%374 = llvm.mul %351, %c1024 : i64
%376 = llvm.mul %367, %c32 : i64
%377 = llvm.add %374, %376 : i64
%378 = llvm.add %377, %371 : i64
%379 = llvm.getelementptr %aligned_base_ptr[%378] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>
%380 = nvvm.ldmatrix %379 {layout = #nvvm.mma_layout<row>, num = 4 : i32} : (!llvm.ptr<f16, 3>) -> !llvm.struct<(i32, i32, i32, i32)>

@nicolasvasilache to answer your question, I don't know yet. I'm figuring out what we do and why it goes south :).

Maybe the solution is simple, but I haven't root-caused the issue yet, so I don't know.

Here are my findings so far.

The TL;DR is we should lower the address computations (including the gep) so that the loop invariant part is isolated in its expression subtree instead of being spread around. E.g., do loopVariant + (inv1 + inv2) instead of (loopVariant + inv1) + inv2). Moreover we probably want to make the indices produced by the GPU pipeliner easier to understand with respect to loop strength reduction.

Details

Context

Again the surviving math comes from iree-convert-to-nvvm. The input to that pass is:

    %91 = nvgpu.ldmatrix %alloc_0[%85, %37, %41] {numTiles = 4 : i32, transpose = false} : memref<4x32x32xf16, #gpu.address_space<workgroup>> -> vector<4x2xf16>

Where %85 is loop variant and %37 and %41 are not.

This gets lowered to:

Loop dependent input: %384 // Offset for dim 0
          %391 = llvm.mlir.constant(1024 : index) : i64 // Constant for stride dim 0: 32x32
          %392 = llvm.mul %384, %391  : i64 // Compute offset for dim 0
          %395 = llvm.add %392, %394  : i64 // Add offset from dim 1 (computed outside the loop)
          %396 = llvm.add %395, %212  : i64 // Add offset from dim 2
          %397 = llvm.getelementptr %390[%396] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3> // Base ptr + fully resolved offset
                      // %390 gets resolved to `ptr addrspace(3) getelementptr ([0 x i8], ptr addrspace(3) @__dynamic_shared_memory__, i32

And then produces the following codegen:

A:        shl.b64         %rd99, %rd147, 10; // Compute offset for dim 0
B:        add.s64         %rd100, %rd99, %rd11; // Add offset from dim 1
C:        or.b64          %rd101, %rd100, %rd10; // Add offset from dim 2
D:        shl.b64         %rd102, %rd101, 1; // Apply gep scaling factor (f16: 2Bytes)
E:        add.s64         %rd104, %rd53, 2048; // Offset from the base; loop invariant
F:        add.s64         %rd105, %rd104, %rd102; // Do base + offset
G:        ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r16, %r17, %r18, %r19}, [%rd105];

Forget about << 10 for a minute (a.k.a. mul ..., 1024, a.k.a. %392). The whole codegen is a direct translation of the mlir code: B == %395, C == %396, D-F == %397. I.e., the backend was not able to extract the loop invariant part from the computation.

Loop invariant computation not hoisted

Generally speaking llvm is able to expose the loop invariant part of an expression and hoist them out, so why is it not happening here?

The issue is the expression (loopVariant + inv1) + inv2 doesn't get reassociated into loopVariant + (inv1 + inv2) because the subexpression loopVariant + inv1 is used more than once (this is the "base" address for both the first ldmatrixs). The rationale is if you reassociate, you may duplicate the computation, so it is not guaranteed to be beneficial and the optimization just bails on these cases. I've hacked into the optimization to still do the optimization in these cases (see https://github.com/iree-org/iree-llvm-fork/commit/9dda6f18e6429769b1d07b42e85bc8aa7d6d7b1b) (you'll need to run with -reassociate-use-score=false and -reassociate-make-single-use=true to enable the hack). This allows us to replace B and C with just one add in the loop.

If you're paying attention you may wonder why that fix is not taking care of E and F too. When we reach reassociate, D-F is still a monolithic gep %base, %offset, hence the base offset 2048 and %base don't appear in the reassociatable expression. If we were to expand the gep beforehand, the reassociation for base + 2048 (E and F) and the hoisting would happen (but that may screw up addressing mode matching.) Ideally we would like all the loop invariant stuff to be added with just one operation and even with that we still won't be there. Indeed the scaling factor applied on the offset of feeding the gep would not be reassociatable. I.e, if we were to expand the gep beforehand we would end up with:

((loopVarOffsetDim0 + invOffsetDim1) + invOffsetDim2) * 2 + invBase

The * 2 (or shl ..., 1) needs to be distributed to be able to move all the invariant expressions in the same subtree:

loopVarOffsetDim0 * 2 + ((invOffsetDim1) + invOffsetDim2) * 2 + invBase)

Generally speaking that's not beneficial to distribute:

From reassociate alone, we trade two additions and 1 shift with two additions and 2 shifts (i.e., we add one shift)
Now if you take LICM into account, we trade 1 inside loop addition, 1 outside loop addition and 1 inside loop shift, with 1 inside loop addition, 1 outside loop addition, 1 outside loop shift and 1 inside loop shift. (So we add 1 outside loop shift for apparently no reason) The only reason this is beneficial is because this exposes reassociating and hoisting one more add.

Long story short, what we want to do is:

Reorder the index computation so that all the loop invariant values are on the same side of the expression tree
Expand the gep early so that we see it as part of the index computation

Loop strength reduction not happening

Going back to the << 10 that we do for each loop iteration. The issue is that our loop variant part is not what the loop strength reduce optimization call a recursive expression. Without going into too much details the issue is these loop variants are not expressed from the induction variable. This happens because of the GPU pipelining that we are doing. Instead of having something like i-th access is base + i-th * 1024, we rotate through different variable. In a nutshell, we have 4 "iterators" and we rotate through them. We do that by describing the accesses like so:

A = phi (0, B)
B = phi (1, C)
C = phi (2, D)
D = phi (3, some_math(IV)  mod 4)

Because of this representation, LLVM just doesn't recognize that A, B, C, and D accesses can be expressed through the IV.

We probably want to do something like:

A = phi (0, some_math(IV) + 1  mod 4)
B = phi (1, some_math(IV) + 2  mod 4)
C = phi (2, some_math(IV) + 3 mod 4)
D = phi (3, some_math(IV)  mod 4)

I.e., have IV in the expression.

@ThomasRaoux for the record if I disable the pipeliner, we get nicer patterns, hence I believe we need to do something on that front:

        add.s64         %rd77, %rd10, %rd68;
        shl.b64         %rd78, %rd77, 1; 
        add.s64         %rd79, %rd73, %rd78; // <- we're missing the expand of gep early to hoist that
        ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r13, %r14, %r15, %r16}, [%rd79];

To reproduce, on top of https://github.com/openxla/iree/pull/12539 (which adds -iree-codegen-llvmgpu-use-gpu-pipeliner=false), run:

iree-compile <input>.mlir  -iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_80 -o matmul.vmfb -iree-codegen-llvmgpu-use-mma-sync -reassociate-make-single-use -reassociate-use-score=false -iree-codegen-llvmgpu-use-gpu-pipeliner=false

Thanks Quentin, nice analysis!

I am looking at your description for the lack of hoisting and I believe there are some relatively easy potential wins by ordering the expressions to avoid creating a dependent chain. Similarly the gep could always go through a bitcast to i8* without much difficulty.

The pipelining front is separable and you seem to already have a potential solution

One thing I am not clear about is whether your last experiment demonstrates that we only need to fix the pipelining modulus dependence chain and then the rest falls out of the tree automatically even in the current infra (minus the gep part) ?

Thanks Quentin, nice analysis!

I am looking at your description for the lack of hoisting and I believe there are some relatively easy potential wins by ordering the expressions to avoid creating a dependent chain. Similarly the gep could always go through a bitcast to i8* without much difficulty.

The pipelining front is separable and you seem to already have a potential solution

One thing I am not clear about is whether your last experiment demonstrates that we only need to fix the pipelining modulus dependence chain and then the rest falls out of the tree automatically even in the current infra (minus the gep part) ?

Ah yeah, no sorry, the last experiment was with all the LLVM hacks turned on (albeit because of the gep part we actual just remove one add from the loop). The pipeline part is only about getting rid of the << 10 (or << 11 when the scaling gets merge). When we get to the point where this is the only thing left to fix I would be happy. Don't know if what I suggested will be enough though :).

Quick update here.

TL;DR We issue nicer address indexing (in the prototype), but still no perf improvement.

Nicolas implemented some decomposition of affine.applys based on which loop a value is defined https://reviews.llvm.org/D145685. I've plugged that in for ldmatrix in IREE (in a hacky way https://github.com/openxla/iree/pull/12539/commits/134298d6c52484865a9d43c4bae4c644c0ffd688) and it got the hoisting we wanted from the reassociation but we're still missing the hoisting for breaking down the geps. (See below for the actual asm.)

Note: I had to disable LLVM's reassociation because it was pushing back down the values in the expression tree and undoing the nice ordering. Something to look closer. I'll file an issue on that.

We also need to think about plugging some distribution otherwise we won't be able to hoist the full loop invariant expression. (E.g., (loopVarOffsetDim0 + invOffsetDim2) * 2 + invBase vs. loopVarOffsetDim0 * 2 + ((invOffsetDim1 * 2 + invBase).

Anyhow on this particular reproducer, there is zero performance impact. Looking at the ncu profile, this is not surprising. The HW is left idle (if I understand the report correctly) for a significant portion of the time. My take on the report is we're waiting on the loaded values to become available before we can do anything. So we need to increase occupancy to increase the number of warps we can cycle through while waiting. I'm speculating at this point.

On the occupancy front, that may be difficult to make some improvement because what limits the occupancy is our usage of shared memory. I.e., even if we were to switch to 32-bit arithmetic instead of 64-bit for the address computation to help reduce the register pressure that will not improve occupancy.

ncu report

On the occupancy being limited by our usage of shared memory This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared memory.
On the HW being idle Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 3.2 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 7.71 active warps per scheduler, but only an average of 0.65 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps, avoid possible load imbalances due to highly different execution durations per warp.

ASM generated with the prototype

        shl.b64         %rd104, %rd149, 10;
        add.s64         %rd105, %rd11, %rd104;
        shl.b64         %rd106, %rd105, 1;  # Missing distribution
        add.s64         %rd108, %rd63, 2048; # Missing gep breakdown and reassociate
        add.s64         %rd109, %rd108, %rd106;
        ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r14, %r15, %r16, %r17}, [%rd109];
        add.s64         %rd110, %rd12, %rd104;
        shl.b64         %rd111, %rd110, 1;
        add.s64         %rd112, %rd108, %rd111; # Missing gep breakdown and reassociate
        ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r18, %r19, %r20, %r21}, [%rd112];
        add.s64         %rd113, %rd13, %rd104;
        shl.b64         %rd114, %rd113, 1;
        add.s64         %rd115, %rd63, %rd114; # Missing gep breakdown and reassociate
        ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16 {%r22, %r23, %r24, %r25}, [%rd115+10240];
        ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16 {%r26, %r27, %r28, %r29}, [%rd115+11264];

@qcolombet Checking if this is still a P1?

iree-org / iree

Improve indexing code sequence to hoist out the loop #12517