qcolombet commented 1 year ago

What happened?

When compiling for cuda with --iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy and --td-matmul-strategy-use-mma-sync I run into miscompiles.

Steps to reproduce your issue

Using the following IR:

module {
  func.func @fill_matmul_static(%arg0: tensor<3452x2044xf32>, %arg1: tensor<2044x1024xf32>) -> tensor<3452x1024xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<3452x1024xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<3452x1024xf32>) -> tensor<3452x1024xf32>
    %2 = linalg.matmul ins(%arg0, %arg1 : tensor<3452x2044xf32>, tensor<2044x1024xf32>) outs(%1 : tensor<3452x1024xf32>) -> tensor<3452x1024xf32>
    return %2 : tensor<3452x1024xf32>
  }
}

Compile with:

iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_80 --td-matmul-strategy-use-mma-sync  --iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy -o <out>.vmfb <in>.mlir

Run:

iree-run-module --function=fill_matmul_static --device=cuda  --module=<out>.vmfb --input=3452x2044xf32=1 --input=2044x1024xf32=1

Results:

3452x1024xf32=[2012 2012 2012 2012 2012 2012 ...

Expected results:

3452x1024xf32=[2044 2044 2044 2044 2044 2044 ...

I get the expected results if I do one of these things on the iree-compile command line:

Disable the td strategy all together: --iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy=false (hence, instead of --iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy=true)
Disable mma sync: --td-matmul-strategy-use-mma-sync=false (hence, instead --td-matmul-strategy-use-mma-sync=true)
Disable the async copies: -td-matmul-strategy-use-async-copies=false

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

nicolasvasilache commented 1 year ago

Ok so this seems to hint at some issue with async copies / pipelining with mma_sync. This could also be related to the too large performance drop we see from v128 to v4 mentioned in #13227

allieculp commented 1 year ago

@mattwalsh for visibility, might need to bring this into our syncs this week.

allieculp commented 1 year ago

@mattwalsh @nicolasvasilache Is this a priority given the conversation today about unaligned matmul work? Or should we deprioritize for now?

nicolasvasilache commented 1 year ago

I suspect that if we have an issue with the pipelining, this could also affect the aligned cases. However I have not investigated this particular issue to have a strong opinion here.

allieculp commented 1 year ago

@manishucsd to evaluate priority of this and any issues with aligned cases.

qcolombet commented 1 year ago

Quick update here, the problem seems to be around the pipelining of the mma instructions. If I use a pipeline depth of 1 (i.e., no pipelining) the output is correct. If I use a pipeline depth of 2, the output is missing 16. If I use a pipeline depth of 3, the output is missing 32 (16 x 2).

My guess is we are missing to flush the pipeline at the end and for the aligned case it works because there's nothing remaining in flight.

Note: The pipeline depth is controlled with -td-matmul-strategy-pipeline-depth=<depth>.

qcolombet commented 1 year ago

Side note: The iree.pipeline_shared_memory_copies transform operation uses loadGlobalStage0 as the scheduling strategy and there is no way in the transform dialect to change that. Just noting it here because I was expecting that we would use the nvidiaTensorCore strategy here.

It doesn't seem to be as easy as changing the default to get to use that in the unaligned case (I get an empty schedule because some of the preconditions don't match.)

qcolombet commented 1 year ago

One final data point for today, if I enable the peeling of the epilogue (flip the boolean to true on this line https://github.com/openxla/iree/blob/f396c05541bc7b81faa432ede07e165371f34c40/compiler/src/iree/compiler/Codegen/LLVMGPU/TransformExtensions/LLVMGPUExtensions.cpp#L767), I still get incorrect outputs but they are only incorrect by 4 * (pipeline_depth - 1).

Anyhow, I'll look closer at the actual codegen.

qcolombet commented 1 year ago

I think I found why the pipelining is wrong but I still don't have the root cause.

The multibuffering doesn't multi buffer one of the buffers (the one that feeds the ldmatrix instructions) and as a result we don't have enough memory to effectively pipeline anything on that front.

This can be seen in the IR after multibuffering:

%121 = nvgpu.ldmatrix %subview_4[%16, %17] {numTiles = 4 : i32, transpose = false} : memref<?x16xf32, strided<[16, 1], offset: ?>, #gpu.address_space<workgroup>> -> vector<4x1xf32>

The outer dim here should start with the pipeline depth whereas here it is just flat (i.e., we should have <depthx?x16xf32> but we have <?x16xf32>).

The second thing that is problematic here (and I believe why multibuffering doesn't do anything on this buffer) is the fact that the ldmatrix access doesn't depend on the induction variable.

I had a quick look at the aligned matrix lowering and the IV is added during "iree-llvmgpu-tile-and-distribute" on the memory accesses that will be pipelined.

For comparison with the aligned matrix code, run the following command line:

iree-compile <aligned>.mlir   --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_80 --iree-codegen-llvmgpu-use-mma-sync

With the following IR:

module {
  func.func @fill_matmul_static(%arg0: tensor<3456x2048xf32>, %arg1: tensor<2048x1024xf32>) -> tensor<3456x1024xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<3456x1024xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<3456x1024xf32>) -> tensor<3456x1024xf32>
    %2 = linalg.matmul ins(%arg0, %arg1 : tensor<3456x2048xf32>, tensor<2048x1024xf32>) outs(%1 : tensor<3456x1024xf32>) -> tensor<3456x1024xf32>
    return %2 : tensor<3456x1024xf32>
  }
}

qcolombet commented 1 year ago

Multibuffering doesn't kick in on one of the buffer because we have a use outside the loop:

  // Subview outside the loop.
%subview_4 = memref.subview %alloc_2[%6, 0] [%8, 16] [1, 1] : memref<128x16xf32, #gpu.address_space<workgroup>> to memref<?x16xf32, strided<[16, 1], offset: ?>, #gpu.address_space<workgroup>>
%22:32 = scf.for %arg0 = %c0 to %c2044 [...]
  %120 = nvgpu.ldmatrix %subview_4[%16, %17]...

If I manually put this subview inside the loop, multibuffering applies to this buffer just fine. This subview is hoisted out by the extract-address-computations transform. If I disable this transform, I get the correct results.

I think we have two problems here:

Correctness: we shouldn't apply the pipelining if any of the buffers involved in a given loop hasn't been multibuffer'ed. I have no idea how we could do that in the transform dialect currently. (@nicolasvasilache, @ftynse for reference.)
Performance: we need to have a more robust way of multibuffering/pipelining loops. Let's say we fix the correctness issue mentioned previously, then we end up with a loop that is not pipeline-able and the performance will be bad.

I think the aligned matmul strategy gets away with the subview-outside-of-the-loop problem because it runs the different transformations in the opposite order: multi-buffering -> extract-address-computations.

If I use this order for the transform dialect strategy I get the correct codegen because I hide the problem away. In other words, if for any other reason the multibuffering doesn't apply in the future, then the piplelining will miscompile again.

qcolombet commented 1 year ago

Here is the workaround I used locally https://github.com/openxla/iree/pull/13718. If we land that we should be able to turn the td tensor core strategy on, but I would rather that we first fix the pipelining to not produce incorrect code.

ftynse commented 1 year ago

Correctness: we shouldn't apply the pipelining if any of the buffers involved in a given loop hasn't been multibuffer'ed. I have no idea how we could do that in the transform dialect currently. (@nicolasvasilache, @ftynse for reference.)

Currently, multibuffering transform takes an allocation handle and returns an allocation handle, there is no connection to the loop. We may want to change this or introduce a different transform (and lower within transform dialect, yay!) that tries to multibuffer all allocations in the given loop. If it can't, it would fail and/or return an empty handle to loops so the following pipelining doesn't apply. A version of this is possible to express, although clunkily:

// This ensures that failure to multibuffer doesn't stop the
// rest of the ransformation.
transform.sequence failures(suppress) {
  // This ensures that failure to multibuffer prevents pipelining by
  // immediately propagating it to the caller.
  transform.sequence failures(propagate) {
    %loop = ...
    %allocs = transform.structured.match ops{["memref.alloc"]} in %loop
    transform.memref.multibuffer %allocs
    transform.loop.pipeline %loop
  }
}
// continuing to transforms

Alternatively, we can modify the pipelining transform to fail if it wouldn't be valid. I don't know if there is a sufficiently easy check to implement on the payload IR, but having such a check sounds like a good idea in general. (In theory, we could have transform generating temporarily invalid IR as long as it is made valid by a further transform).

qcolombet commented 1 year ago

Filed https://github.com/llvm/llvm-project/issues/62885 to make multibuffering smarter.

nicolasvasilache commented 1 year ago

This subview is hoisted out by the extract-address-computations transform. If I disable this transform, I get the correct results.

Nice catch, I indeed danced around with fold_memref_aliases and extract_address_computations to work around these issues a while back in the aligned case. The transformations to remedy the issue are indeed available but the offending transformation (async copy pipelining) itself is still a footgun.

we shouldn't apply the pipelining if any of the buffers involved in a given loop hasn't been multibuffer'ed. I have no idea how we could do that in the transform dialect currently

The TD aspect here is orthogonal, the implementation of pipelining should be more robust and fail conservatively.

Currently, multibuffering transform takes an allocation handle and returns an allocation handle, there is no connection to the loop. We may want to change this or introduce a different transform (and lower within transform dialect, yay!) that tries to multibuffer all allocations in the given loop.

I am unclear that it is easy to distinguish between different failures to multi-buffer but I like the trick with suppress/propagate nesting to handles failures more gracefully. The issue is that the first transform has to fail so maybe we want the alternatives here.. I have not yet played with that construct myself.

nicolasvasilache commented 1 year ago

Added a comment in #13718 after trying locally, I am afraid this won't fly as it actually disables all the important transformations.

Digging deeper, the issue I am seeing is that fold_memref_aliases does not fold subview with ldmatrix. IIRC you added ExtractAddressComputations for this op, you could try to also add this for the folding case.

You could look at https://gist.github.com/nicolasvasilache/841404d4b7c2c5ed92c7be803ef9f415#file-gistfile1-txt-L102 where I commented out the tail of the transform produced from IREE HEAD.

// This first subview does not need to be folded into consuming vector.store as these are hoisted and not pipelined.
// Still the folding into vector.store is currently missing.
%subview = memref.subview %alloc[%7, %8]
// This second subview needs to be folded into consuming nvgpu.ldmatrix, this currently does not happen.
// The folding into ldmatrix is currently missing.
%subview_4 = memref.subview %alloc_2[%7, 0]
// This alloc is properly folded and will be MB'ed 3 way. 
%alloc_3 = memref.alloc() {alignment = 64 : i64} : memref<16x128xf32, #gpu.address_space<workgroup>>

We should implement these 2 extra fold_memref_aliases, should be quite easy to extend now, this will also be useful in other contexts.

Still, improving multi-buffering so that it "sees" through subviews is likely the best scenario long term (coupled with conservative copy-async failure).

qcolombet commented 1 year ago

Added a comment in #13718 after trying locally, I am afraid this won't fly as it actually disables all the important transformations.

Interesting, I could have swear the transformations were happening just fine.

Digging deeper, the issue I am seeing is that fold_memref_aliases does not fold subview with ldmatrix.

Yes, that’s another option. I decided against it because the aligned case didn’t need it and works fine. I’m happy to go down that road, that’s generally helpful. (There may be some dependency issue, I don’t remember if it’s okay for fold memref to depend on nvgpu.)

qcolombet commented 1 year ago

Yes, that’s another option. I decided against it because the aligned case didn’t need it and works fine. I’m happy to go down that road, that’s generally helpful. (There may be some dependency issue, I don’t remember if it’s okay for fold memref to depend on nvgpu.)

Filed https://github.com/llvm/llvm-project/issues/62906 for that.

nicolasvasilache commented 1 year ago

13811 is another instance of this issue in the aligned case, as suspected.

nicolasvasilache commented 1 year ago

https://reviews.llvm.org/D151412 and related foldings are part of the solution here: with these, multi-buffering operates as expected and the pipelining does not generate incorrect code.

Still, pipelining is too brittle, irrespective of transform dialect.

allieculp commented 1 year ago

Lowering to P2 as we have a workaround in place.

qcolombet commented 1 year ago

Here is the iree-opt reproducer for this specific issue.

I've attached two input IRs:

succeeded-multi-buffer.mlir.txt is the input of the pipelining optimization when multi-buffering works
failed-multi-buffer.mlir.txt is the input of the pipelining optimization when multi-buffering fails

The gist of it is one of the buffer is missing the proper space to effectively allow multi staging buffering. In this particular instance, this is alloc_2: observe how it is missing the 3x dimension in the memref definition (by comparing the input IRs).

Now, since alloc_2 is used both for loading (through ldmatrix) and storing (through async_copy) within the main loop, this is incorrect to pipeline this loop without pipelining the accesses to this buffer (which we cannot do since the allocation is too small).

The probably easiest thing to do here would be to not pipeline when we cannot prove that this is safe. Longer term, I think it may make sense to only do the multi buffering when we are sure we will apply the pipelining.

To reproduce (using td-reproducer.mlir.txt)

Good codegen:

iree-opt succeeded-multi-buffer.mlir  --pass-pipeline='builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-lower-executable-target)))' --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-codegen-llvmgpu-use-transform-dialect=td-reproducer.mlir -o good.mlir

Bad codegen:

iree-opt failed-multi-buffer.mlir  --pass-pipeline='builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-lower-executable-target)))' --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-codegen-llvmgpu-use-transform-dialect=td-reproducer.mlir -o bad.mlir

Looking at the bad codegen (bad.mlir from the reproducer), we can notice that:

subview_4 is outside of the loop (which is a subview of alloc_2)
ldmatrixs in the loop load from subview_4 Since subview_4 is not accessed in a pipelined fashion, the data accessed by ldmatrix is not properly staged.

Now looking at the good codegen for comparison, we can see that:

ldmatrix accesses a base address that is loop dependent
the staging of alloc_2 data happens properly towards the end of the loop through subview_30

Anyhow, again the right short term fix is probably not to pipeline when the buffers are not properly sized.

qcolombet commented 1 year ago

I have a hard time producing an invalid schedule with the pipelining optimization upstream. Yes, I can produce invalid schedules but only if I give invalid stages/op_orders attributes. I am starting to think that the issue is in how we produce the schedule information within IREE rather than a bug in the generic pipeliner.

I guess the generic pipeliner could be smarter about catching the inconsistency in the pipelining info, but that's probably hard to do with all the dialects that can be thrown at it.

qcolombet commented 1 year ago

At this point I think we need to fix these two functions:

getPipelineStages from LLVMGPUPipelining.cpp
getNvidiaAmpereTensorCorePipeline from GPUPipelining.cpp

We shouldn't produce a schedule if we cannot guarantee that it is valid.

iree-org / iree

`--iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy` miscompiles with mma-sync #13451

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

13811 is another instance of this issue in the aligned case, as suspected.