iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.48k stars 553 forks source link

GPU/SPIR-V codegen should lower `hal.interface.workgroup.size` ops. #16554

Open benvanik opened 4 months ago

benvanik commented 4 months ago

Today it does not seem to lower to WorkgroupSize (SPIR-V) or blockDim (CUDA):

tools/test/iree-benchmark-executable.mlir:56:27: error: failed to legalize operation 'hal.interface.workgroup.size'
      %workgroup_size_x = hal.interface.workgroup.size[0] : index
                          ^
tools/test/iree-benchmark-executable.mlir:56:27: note: see current operation: %12 = "hal.interface.workgroup.size"() {dimension = 0 : index} : () -> index
tools/test/iree-benchmark-executable.mlir:44:1: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.3, [Shader, GroupNonUniform], [SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers]>, api=Vulkan, #spirv.resource_limits<max_compute_workgroup_size = [128, 128, 64], subgroup_size = 64, cooperative_matrix_properties_khr = []>>}>

Supporting this is needed to write portable inputs that lower to all backends (we can't use gpu dialect ops).

This input with iree-compile --iree-hal-target-backends=vulkan-spirv or =cuda fails in the same way, passes on llvm-cpu:

#pipeline_layout = #hal.pipeline.layout<push_constants = 0, sets = [
  #hal.descriptor_set.layout<0, bindings = [
    #hal.descriptor_set.binding<0, storage_buffer>,
    #hal.descriptor_set.binding<1, storage_buffer>,
    #hal.descriptor_set.binding<2, storage_buffer>
  ]>
]>
hal.executable.source public @executable {
  hal.executable.export public @elementwise_mul ordinal(0) layout(#pipeline_layout) attributes {
    workgroup_size = [1 : index, 1 : index, 1 : index]
  } {
  ^bb0(%device: !hal.device):
    // UNUSED
    %c1 = arith.constant 1 : index
    hal.return %c1, %c1, %c1 : index, index, index
  }
  builtin.module {
    func.func @elementwise_mul() {
      %lhs = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(32) : !flow.dispatch.tensor<readonly:tensor<4xf32>>
      %rhs = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(32) : !flow.dispatch.tensor<readonly:tensor<4xf32>>
      %dst = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(32) : !flow.dispatch.tensor<writeonly:tensor<4xf32>>
      %workgroup_size_x = hal.interface.workgroup.size[0] : index
      %workgroup_id_x = hal.interface.workgroup.id[0] : index
      %workgroup_count_x = hal.interface.workgroup.count[0] : index
      %base_i = affine.apply affine_map<()[s0, s1] -> (s0 * s1)>()[%workgroup_id_x, %workgroup_size_x]
      %step_i = affine.apply affine_map<()[s0, s1] -> (s0 * s1)>()[%workgroup_count_x, %workgroup_size_x]
      %end_i = arith.constant 4 : index
      scf.for %i = %base_i to %end_i step %step_i {
        %remaining = affine.min affine_map<(d0)[s0] -> (s0, -d0 + 4)>(%i)[%workgroup_size_x]
        %lhs_tile = flow.dispatch.tensor.load %lhs, offsets = [%i], sizes = [%remaining], strides = [1] : !flow.dispatch.tensor<readonly:tensor<4xf32>> -> tensor<?xf32>
        %rhs_tile = flow.dispatch.tensor.load %rhs, offsets = [%i], sizes = [%remaining], strides = [1] : !flow.dispatch.tensor<readonly:tensor<4xf32>> -> tensor<?xf32>
        %dst_init = tensor.empty(%remaining) : tensor<?xf32>
        %dst_tile = linalg.generic {
          indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>],
          iterator_types = ["parallel"]
        } ins(%lhs_tile, %rhs_tile : tensor<?xf32>, tensor<?xf32>)
          outs(%dst_init : tensor<?xf32>) {
          ^bb0(%lhs_value: f32, %rhs_value: f32, %init_value: f32):
            %dst_value = arith.mulf %lhs_value, %rhs_value : f32
            linalg.yield %dst_value : f32
          } -> tensor<?xf32>
        flow.dispatch.tensor.store %dst_tile, %dst, offsets = [%i], sizes = [%remaining], strides = [1] : tensor<?xf32> -> !flow.dispatch.tensor<writeonly:tensor<4xf32>>
      }
      return
    }
  }
}
vivekvpandya commented 3 months ago

This seems to be already fixed. on latest build following works fine. ./iree-compile --compile-mode=hal-executable --iree-hal-target-backends=cuda ~/dev/iree/tools/test/iree-benchmark-executable.mlir ./iree-compile --compile-mode=hal-executable --iree-hal-target-backends=vulkan-spirv ~/dev/iree/tools/test/iree-benchmark-executable.mlir