iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 612 forks source link

Modules with convolutions fail to load on ROCM. #18534

Closed JamesMBartlett closed 1 month ago

JamesMBartlett commented 1 month ago

What happened?

I'm seeing the following error when attempting to run any module with convolution operators on ROCM with a gfx1100 target:

iree/runtime/src/iree/hal/drivers/hip/native_executable.c:309: INTERNAL; HIP driver error 'hipErrorNoBinaryForGpu' (209): no kernel image is available for execution on the device; mismatched target chip? missing/wrong bitcode directory?; while invoking native function hal.executable.create; while calling import;

The minimal repro I have is the following MLIR:

func.func @conv2d(%arg0: tensor<1x3x420x420xf32>) -> tensor<1x32x208x208xf32> {
  %cst = arith.constant dense<1.000000e+00> : tensor<32x3x6x6xf32>
  %0 = tensor.empty() : tensor<1x32x208x208xf32>
  %1 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<2> : vector<2xi64>} ins(%arg0, %cst : tensor<1x3x420x420xf32>, tensor<32x3x6x6xf32>) outs(%0 : tensor<1x32x208x208xf32>) -> tensor<1x32x208x208xf32>
  return %1 : tensor<1x32x208x208xf32>
}

Steps to reproduce your issue

  1. Build iree-compile and iree-run-module off of main.
  2. Run: iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx1100 --iree-hip-bc-dir=/opt/rocm/amdgcn/bitcode linalg_conv.mlir -o linalg_conv.vmfb
  3. Then running the built module (HIP_VISIBLE_DEVICES=0 iree-run-module --device='hip://0' --module=linalg_conv.vmfb --input="1x3x420x420xf32=1.1") will cause the above error.

where linalg_conv.mlir is the MLIR from above.

I also tried without the explicit bitcode directory and it failed in the same way.

I also tried with the gfx1036 target and the integrated GPU on my machine and saw the same error.

What component(s) does this issue relate to?

Compiler, Runtime

Version information

IREE Commit: 7d823d228b9ed9021d4501de98cf2c462957a2f8

Devices ``` HIP_VISIBLE_DEVICES=0 iree-run-module --dump_devices=hip # ============================================================================ # Enumerated devices for driver 'hip' # ============================================================================ # ===----------------------------------------------------------------------=== # --device=hip://GPU-33333137-3833-6230-3364-663061636638 # Radeon RX 7900 XTX # ===----------------------------------------------------------------------=== - amdhip64_dylib_path: /opt/rocm-6.2.0/lib/libamdhip64.so - gpu-compute-capability: 11.0 - gpu-arch-name: gfx1100 - launch-max-block-dims: (1024, 1024, 1024) - block-max-thread-count: 1024 - block-max-32-bit-register-count: 65536 - block-max-shared-memory: 64 KB - memory-is-integrated-memory: 0 - memory-supports-managed-memory: 1 - memory-total-const-memory-size: 2047 MB - memory-total-global-memory-size: 24560 MB - memory-l2-cache-size: 6291456 bytes - gpu-compute-unit-count: 48 - gpu-compute-max-clock-rate: 2482 mHz - gpu-memory-max-clock-rate: 1249 mHz - gpu-warp-size: 32 ```
rocminfo ``` ROCk module version 6.8.5 is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 7 7700 8-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 7 7700 8-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3800 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Memory Properties: Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131088968(0x7d04248) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131088968(0x7d04248) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131088968(0x7d04248) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1100 Uuid: GPU-331783b03df0acf8 Marketing Name: Radeon RX 7900 XTX Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 6144(0x1800) KB L3: 98304(0x18000) KB Chip ID: 29772(0x744c) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2482 BDFID: 768 Internal Node ID: 1 Compute Unit: 96 SIMDs per CU: 2 Shader Engines: 6 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 232 SDMA engine uCode:: 21 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 ******* Agent 3 ******* Name: gfx1100 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 256(0x100) KB Chip ID: 5710(0x164e) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2200 BDFID: 3328 Internal Node ID: 2 Compute Unit: 2 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: APU Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 21 SDMA engine uCode:: 9 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 65544484(0x3e82124) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 65544484(0x3e82124) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1100 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

I also tried installing ROCM 6.1 instead and saw the same issue.

Additional context

I don't have an issue when running things without convolutions for example I've tried both the following MLIR funcs and had success with the same setup:

abs.mlir ``` func.func @abs(%input : tensor) -> (tensor) { %result = math.absf %input : tensor return %result : tensor } ```
add.mlir ``` func.func @main(%arg0: !torch.vtensor<[2,3,4],f32>) -> (!torch.vtensor<[2,3,4],f32>, !torch.vtensor<[2,3,4],f32>) { %int_const = torch.constant.int 1 %0 = torch.aten.add.Tensor %arg0, %arg0, %int_const : !torch.vtensor<[2,3,4],f32>, !torch.vtensor<[2,3,4],f32>, !torch.int -> !torch.vtensor<[2,3,4],f32> return %0, %0 : !torch.vtensor<[2,3,4],f32>, !torch.vtensor<[2,3,4],f32> } ```
kuhar commented 1 month ago

I triaged this and the issue is caused by us emitting a call to malloc that doesn't get resolved. We should be using alloca instead.

    s_add_u32 s4, s4, malloc@gotpcrel32@lo+4
    s_addc_u32 s5, s5, malloc@gotpcrel32@hi+12
  %27 = call ptr @malloc(i64 ptrtoint (ptr getelementptr (float, ptr null, i32 8) to i64))
  %28 = addrspacecast ptr %27 to ptr addrspace(5)
 %alloc = memref.alloc() : memref<1x2x1x4xf32, #gpu.address_space<private>>

This alloc is generated by iree-codegen-iree-comprehensive-bufferize.

Relevant piece of code: https://github.com/iree-org/iree/blob/740e301d61e18a5833e1a2d75b476ae850f8c17e/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp#L113