[GPU] Garbage values in output when unused inputs are present

martin-luecke commented 6 months ago

What happened?

I have an mlir input that I compile using iree-compile and run via rocm using iree-run-module where the number of inputs messes with the output of the program even if it is unused.

I include a reproducer below which takes an unused input (arg0) and writes the constant 1.0 into the output. When executed, the reproducer produces only garbage values. Commenting in the 4 lines to use arg0 and copy some values into shared memory makes the output work as expected. The input is not used in any other way.

Removing arg0 from the program entirely, such that there are no inputs at all makes the program work as well. This makes me think that the handling of the output buffer in the dispatch somehow depends on the input buffers and their usage.

Steps to reproduce your issue

This is my input IR:

#translation = #iree_codegen.translation_info<None workgroup_size = [128, 2, 1] subgroup_size = 64>
module {
  stream.executable private @func {
    stream.executable.export public @func workgroups() -> (index, index, index) {
      %c2 = arith.constant 2 : index
      %c1 = arith.constant 1 : index
      stream.return %c2, %c2, %c1 : index, index, index
    }
    builtin.module {
      func.func @func(%arg0: !stream.binding,  %arg2: !stream.binding) attributes {translation_info = #translation} {
        %c19 = arith.constant 19 : index
        %c18 = arith.constant 18 : index
        %c17 = arith.constant 17 : index
        %c3 = arith.constant 3 : index
        %c2 = arith.constant 2 : index
        %c1 = arith.constant 1 : index
        %c4 = arith.constant 4 : index
        %c64 = arith.constant 64 : index
        %c32 = arith.constant 32 : index
        %c16 = arith.constant 16 : index
        %c0 = arith.constant 0 : index
        %workgroup_id_0 = stream.dispatch.workgroup.id[0] : index
        %workgroup_id_1 = stream.dispatch.workgroup.id[1] : index
        %thread_id_x = gpu.thread_id  x
        %thread_id_y = gpu.thread_id  y

        // Commenting in these 4 lines makes the program work
        // %alloc_0 = memref.alloc() : memref<4x4xf32, #gpu.address_space<workgroup>>
        // %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> memref<128x256xf32>
        // %2 = vector.load %0[%c0, %c0] : memref<128x256xf32>, vector<4xf32>
        // vector.store %2, %alloc_0[%c0, %c0] : memref<4x4xf32, #gpu.address_space<workgroup>>, vector<4xf32>

        // writing 1.0 to all positions in the output
        %cst = arith.constant dense<1.000000e+00> : vector<1xf32>
        %4 = arith.divsi %thread_id_x, %c64 : index
        %5 = arith.muli %4, %c32 : index
        %6 = arith.muli %workgroup_id_0, %c64 : index
        %7 = arith.remsi %thread_id_x, %c16 : index
        %8 = arith.remsi %thread_id_x, %c64 : index
        %9 = arith.divsi %8, %c16 : index
        %10 = arith.muli %9, %c4 : index
        %11 = arith.muli %thread_id_y, %c32 : index
        %12 = arith.muli %workgroup_id_1, %c64 : index
        %13 = arith.addi %7, %12 : index
        %14 = arith.addi %13, %11 : index
        %15 = arith.addi %14, %c16 : index
        %16 = stream.binding.subspan %arg2[%c0] : !stream.binding -> memref<128x128xf32>
        %17 = arith.addi %6, %5 : index
        %18 = arith.addi %17, %10 : index
        vector.store %cst, %16[%18, %14] : memref<128x128xf32>, vector<1xf32>
        %19 = arith.addi %18, %c1 : index
        vector.store %cst, %16[%19, %14] : memref<128x128xf32>, vector<1xf32>
        %20 = arith.addi %18, %c2 : index
        vector.store %cst, %16[%20, %14] : memref<128x128xf32>, vector<1xf32>
        %21 = arith.addi %18, %c3 : index
        vector.store %cst, %16[%21, %14] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%18, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%19, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%20, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%21, %15] : memref<128x128xf32>, vector<1xf32>
        %22 = arith.addi %18, %c16 : index
        vector.store %cst, %16[%22, %14] : memref<128x128xf32>, vector<1xf32>
        %23 = arith.addi %18, %c17 : index
        vector.store %cst, %16[%23, %14] : memref<128x128xf32>, vector<1xf32>
        %24 = arith.addi %18, %c18 : index
        vector.store %cst, %16[%24, %14] : memref<128x128xf32>, vector<1xf32>
        %25 = arith.addi %18, %c19 : index
        vector.store %cst, %16[%25, %14] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%22, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%23, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%24, %15] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%25, %15] : memref<128x128xf32>, vector<1xf32>
        return
      }
    }
  }
  func.func @isolated_benchmark(%arg0: tensor<128x256xf32>) -> tensor<128x128xf32> {
    %2 = flow.dispatch @func::@func(%arg0) : (tensor<128x256xf32>) -> tensor<128x128xf32>
    return %2 : tensor<128x128xf32>
  }
}

Compiled using:

iree-compile example.mlir \
    --iree-hal-target-backends=rocm \
    -o example.vmfb

Ran using:

iree-run-module \
  --device=rocm://0 \
  --device_allocator=caching \
  --module=example.vmfb \
  --function=isolated_benchmark \
  --input=4xf32 \

What component(s) does this issue relate to?

No response

Version information

Built from commit161be85aeb4b5003a1a19053967c35f2bd00c762

Additional context

No response

benvanik commented 6 months ago

You're changing the ABI and inducing undefined behavior :) You cannot change the function interface after the dispatch has been formed. A flow.dispatch must be consistent with the target of the dispatch (in this case, your function).

benvanik commented 6 months ago

(the technical detail here is that ROCM takes all buffers and other arguments as a packed struct, and if you change the order/counts of any input that struct layout changes - if you pass a struct in one layout from the caller to the callee expecting a different layout you'll get garbage)

martin-luecke commented 6 months ago

When I remove the input I change the dispatch as well, so without the input my dispatch looks like this:

  func.func @isolated_benchmark() -> tensor<128x128xf32> {
    %2 = flow.dispatch @func::@func() : () -> tensor<128x128xf32>
    return %2 : tensor<128x128xf32>
  }

When I change the layout of inputs, without adjusting the dispatch I get an error about the type mismatch as expected.

Or are you referring to another portion of the code?

benvanik commented 6 months ago

Not sure I'm following.

The rule is: the dispatch and the dispatchee must match exactly. Making that true can be tricky when mixing layers of the stack (stream and flow are two different layers), which is why the inline dispatch region ops (flow.dispatch.region and flow.dispatch.workgroups) exist as they take care of everything for you and lower properly.

benvanik commented 6 months ago

Reopening in case you find the issue - I'm not clear what's happening and would need more information.

qedawkins commented 6 months ago

Before "codegen" the IR looks something like this when the binding is not used

module attributes {hal.device.targets = [#device_target_rocm]} {
  hal.executable private @func {
    hal.executable.variant public @rocm_hsaco_fb target(#executable_target_rocm_hsaco_fb) {
      hal.executable.export public @func ordinal(0) layout(#pipeline_layout) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
      ^bb0(%arg0: !hal.device):
        %c2 = arith.constant 2 : index
        %c1 = arith.constant 1 : index
        hal.return %c2, %c2, %c1 : index, index, index
      }
      builtin.module {
        func.func @func() attributes {translation_info = #translation} {
          %cst = arith.constant dense<1.000000e+00> : vector<1xf32>
          ...
          %12 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : memref<128x128xf32>
          ...
          vector.store %cst, %12[%21, %11] : memref<128x128xf32>, vector<1xf32>
          return
        }
      }
    }
  }
  util.func public @isolated_benchmark(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @isolated_benchmark(%input0: tensor<128x256xf32>) -> (%output0: tensor<128x128xf32>)"}} {
    %c65536 = arith.constant 65536 : index
    %c131072 = arith.constant 131072 : index
    %c0 = arith.constant 0 : index
    %c256 = arith.constant 256 : index
    %c128 = arith.constant 128 : index
    %element_type_f32 = hal.element_type<f32> : i32
    %dense_row_major = hal.encoding_type<dense_row_major> : i32
    hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%c128, %c256]) type(%element_type_f32) encoding(%dense_row_major)
    %0 = stream.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32> in !stream.resource<external>{%c131072}
    %result, %result_timepoint = stream.resource.alloca uninitialized : !stream.resource<external>{%c65536} => !stream.timepoint
    %1 = stream.cmd.execute await(%result_timepoint) => with(%0 as %arg1: !stream.resource<external>{%c131072}, %result as %arg2: !stream.resource<external>{%c65536}) {
      stream.cmd.dispatch @func::@rocm_hsaco_fb::@func {
        ro %arg1[%c0 for %c131072] : !stream.resource<external>{%c131072},
        wo %arg2[%c0 for %c65536] : !stream.resource<external>{%c65536}
      }
    } => !stream.timepoint
    %2 = stream.timepoint.await %1 => %result : !stream.resource<external>{%c65536}
    %3 = stream.tensor.export %2 : tensor<128x128xf32> in !stream.resource<external>{%c65536} -> !hal.buffer_view
    util.return %3 : !hal.buffer_view
  }
}

Looking at this, the pipeline layout includes both bindings (as does the dispatch) but the kernel only uses one of the bindings.

The llvm func we produce for rocm only takes a single argument however

llvm.func @func(%arg0: !llvm.ptr {llvm.align = 16 : i32, llvm.noalias}) attributes {translation_info = #iree_codegen.translation_info<None workgroup_size = [128, 2, 1] subgroup_size = 64>} {

I'm not sure if the input IR should be considered valid, but it does look like rocm might not be obeying the pipeline layout when constructing the llvm function.

qedawkins commented 6 months ago

SPIR-V doesn't let this compile

in.mlir:47:15: error: failed to materialize conversion for result #0 of operation 'hal.interface.binding.subspan' that remained live after conversion
        %16 = stream.binding.subspan %arg2[%c0] : !stream.binding -> memref<128x128xf32>
              ^
in.mlir:47:15: note: see current operation: %30 = "hal.interface.binding.subspan"(%11) {alignment = 64 : index, binding = 1 : index, descriptor_type = #hal.descriptor_type<storage_buffer>, operandSegmentSizes = array<i32: 1, 0>, set = 0 : index} : (index) -> memref<128x128xf32>
in.mlir:50:9: note: see existing live user here: vector.store %cst, %13[%15, %10] : memref<128x128xf32>, vector<1xf32>
        vector.store %cst, %16[%18, %14] : memref<128x128xf32>, vector<1xf32>

benvanik commented 6 months ago

That's believable - the LLVMGPU lowerings have always been sketchy. Today I strongly suspect any difference between the dispatch and the dispatchee will not work correctly. It should, but that requires fixes.

benvanik commented 6 months ago

that's odd as SPIR-V should definitely allow that since we're using Vulkan/SPIR-V's binding model - if there's one codegen target I'd expect to work it'd be that one :P

qedawkins commented 6 months ago

ah wait the problem on SPIR-V is with FlattenMemrefSubspan. Let me try 1d memrefs.

qedawkins commented 6 months ago

Conversion to SPIR-V crashes somewhere in the type converter...

qedawkins commented 6 months ago

ok this IR is unfriendly to SPIR-V lowerings (we expect more things like memory space annotations on memrefs and such) but after some modifications it looks like SPIR-V is doing this properly. We get the single set 0 binding 1

spirv.GlobalVariable @__resource_var_0_1_ bind(0, 1) : !spirv.ptr<!spirv.struct<(!spirv.array<16384 x f32, stride=4> [0])>, StorageBuffer>

which the driver handles fine. So does look like a rocm bug.

I wonder if CPU handles this properly.

iree-org / iree