openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.72k stars 438 forks source link

[ROCm] failed to legalize operation 'math.exp' for exponential op with bf16 dtype #19700

Open hugomano opened 3 days ago

hugomano commented 3 days ago

The following MLIR code is not able to compile anymore for the ROCm platform (6.2 used here), since this commit: https://github.com/openxla/xla/commit/6e9eefeec077f49c2b22bfeee8da537ed8517b22

module @main.Exponential.forward attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func @main(%arg0: tensor<4096x4096xf16>, %arg1: tensor<4096x4096xf16>) -> tensor<4096x4096xf16> {
    %0 = stablehlo.add %arg0, %arg1 : tensor<4096x4096xf16>
    %1 = stablehlo.exponential %0 : tensor<4096x4096xf16>
    return %1 : tensor<4096x4096xf16>
  }
}

Error traceback:

error(pjrt): [PJRT_Client_Compile] <unknown>:0: error: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): failed to legalize operation 'math.exp'
<unknown>:0: note: loc("loop_exponential_fusion"): called from
<unknown>:0: note: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): see current operation: %43 = "math.exp"(%42) <{fastmath = #arith.fastmath<afn>}> : (bf16) -> bf16

HLO dump:

*** Begin module_0001.main.Exponential.forward.before_optimizations.txt ***
HloModule main.Exponential.forward, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}

ENTRY main.5 {
  Arg_0.1 = bf16[4096,4096]{1,0} parameter(0)
  Arg_1.2 = bf16[4096,4096]{1,0} parameter(1)
  add.3 = bf16[4096,4096]{1,0} add(Arg_0.1, Arg_1.2), metadata={source_file="external/zml~/zml/tensor.zig" source_line=3661}
  ROOT exponential.4 = bf16[4096,4096]{1,0} exponential(add.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
}

*** End module_0001.main.Exponential.forward.before_optimizations.txt ***
*** Begin module_0001.main.Exponential.forward.autotune_results.pbtxt ***
version: 3

*** End module_0001.main.Exponential.forward.autotune_results.pbtxt ***
*** Begin module_0001.main.Exponential.forward.gpu_target_config.pbtxt ***
gpu_device_info {
  threads_per_block_limit: 1024
  threads_per_warp: 32
  shared_memory_per_block: 65536
  shared_memory_per_core: 65536
  threads_per_core_limit: 2048
  core_count: 35
  fpus_per_core: 128
  block_dim_limit_x: 2147483647
  block_dim_limit_y: 65536
  block_dim_limit_z: 65536
  memory_bandwidth: 35968000000
  l2_cache_size: 6291456
  clock_rate_ghz: 1.895
  device_memory_size: 31658606592
  shared_memory_per_block_optin: -1
  rocm_compute_capability {
    gcn_arch_name: "gfx1100"
  }
  registers_per_core_limit: 65536
  registers_per_block_limit: 65536
}
platform_name: "ROCM"
dnn_version_info {
  major: 1
  minor: 3
}

*** End module_0001.main.Exponential.forward.gpu_target_config.pbtxt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations.txt ***
HloModule main.Exponential.forward, is_scheduled=true, entry_computation_layout={(bf16[4096,4096]{1,0}, bf16[4096,4096]{1,0})->bf16[4096,4096]{1,0}}, frontend_attributes={fingerprint_before_lhs="189671c249ae20e507ab215f3337a72e"}

fused_exponential {
  param_1.6 = bf16[4096,4096]{1,0} parameter(1)
  convert.3.3 = f32[4096,4096]{1,0} convert(param_1.6)
  param_0.7 = bf16[4096,4096]{1,0} parameter(0)
  convert.4.3 = f32[4096,4096]{1,0} convert(param_0.7)
  add.2.3 = f32[4096,4096]{1,0} add(convert.3.3, convert.4.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=3661}
  convert.5.3 = bf16[4096,4096]{1,0} convert(add.2.3)
  ROOT exponential.2.1 = bf16[4096,4096]{1,0} exponential(convert.5.3), metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
} // fused_exponential

ENTRY main.5 {
  Arg_1.2.0 = bf16[4096,4096]{1,0} parameter(1)
  Arg_0.1.0 = bf16[4096,4096]{1,0} parameter(0)
  ROOT loop_exponential_fusion = bf16[4096,4096]{1,0} fusion(Arg_1.2.0, Arg_0.1.0), kind=kLoop, calls=fused_exponential, metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}
}

*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations.txt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-buffer-assignment.txt ***
BufferAssignment:
allocation 0: size 33554432, output shape is |bf16[4096,4096]|, maybe-live-out:
 value: <9 loop_exponential_fusion @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}
allocation 1: size 33554432, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
 value: <7 Arg_1.2.0 @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}
allocation 2: size 33554432, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:
 value: <8 Arg_0.1.0 @0> (size=33554432,offset=0): bf16[4096,4096]{1,0}

Total bytes used: 100663296 (96.00MiB)

Used values:
<7 Arg_1.2.0 @0>
 positions:
  Arg_1.2.0
 uses:
  loop_exponential_fusion, operand 0
 from instruction: %Arg_1.2.0 = bf16[4096,4096]{1,0} parameter(1)
<8 Arg_0.1.0 @0>
 positions:
  Arg_0.1.0
 uses:
  loop_exponential_fusion, operand 1
 from instruction: %Arg_0.1.0 = bf16[4096,4096]{1,0} parameter(0)
<9 loop_exponential_fusion @0>
 positions:
  loop_exponential_fusion
 uses:
 from instruction: %loop_exponential_fusion = bf16[4096,4096]{1,0} fusion(bf16[4096,4096]{1,0} %Arg_1.2.0, bf16[4096,4096]{1,0} %Arg_0.1.0), kind=kLoop, calls=%fused_exponential, metadata={source_file="external/zml~/zml/tensor.zig" source_line=1686}

HloLiveRange (max 3):
  InstructionSequence:
    0:Arg_1.2.0
    1:Arg_0.1.0
    2:loop_exponential_fusion
  BufferLiveRange:
    Arg_1.2.0{}:0-3
    Arg_0.1.0{}:0-3
    loop_exponential_fusion{}:2-3
  Live ranges at 2 (peak):
    Arg_1.2.0: 33554432 bytes
    Arg_0.1.0: 33554432 bytes
    loop_exponential_fusion: 33554432 bytes

*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-buffer-assignment.txt ***
*** Begin module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-memory-usage-report.txt ***
Total bytes used: 100663296 (96.00MiB)

Allocations sorted by size:

cumulative_size; total_size - cumulative_size; allocation
------------------------------------------------------------------------------
  32.00MiB( 33%);   64.00MiB; allocation 0: size 32.00MiB, output shape is |bf16[4096,4096]|, maybe-live-out:
  64.00MiB( 67%);   32.00MiB; allocation 1: size 32.00MiB, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
  96.00MiB(100%);         0B; allocation 2: size 32.00MiB, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:

Allocations sorted by size with their values:
allocation 0: size 32.00MiB, output shape is |bf16[4096,4096]|, maybe-live-out:
allocation 1: size 32.00MiB, parameter 1, shape |bf16[4096,4096]| at ShapeIndex {}:
allocation 2: size 32.00MiB, parameter 0, shape |bf16[4096,4096]| at ShapeIndex {}:

*** End module_0001.main.Exponential.forward.gfx1100_gpu_after_optimizations-memory-usage-report.txt ***
error(pjrt): [PJRT_Client_Compile] <unknown>:0: error: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): failed to legalize operation 'math.exp'
<unknown>:0: note: loc("loop_exponential_fusion"): called from
<unknown>:0: note: loc(callsite("loop_exponential_fusion" at "loop_exponential_fusion")): see current operation: %43 = "math.exp"(%42) <{fastmath = #arith.fastmath<afn>}> : (bf16) -> bf16

error(zml/module): pjrt-rocm failed to compile following valid MLIR:
module @main.Exponential.forward attributes {mhlo.num_partitions = 1 : i32, mhlo.num_replicas = 1 : i32} {
  func.func @main(%arg0: tensor<4096x4096xbf16>, %arg1: tensor<4096x4096xbf16>) -> tensor<4096x4096xbf16> {
    %0 = stablehlo.add %arg0, %arg1 : tensor<4096x4096xbf16>
    %1 = stablehlo.exponential %0 : tensor<4096x4096xbf16>
    return %1 : tensor<4096x4096xbf16>
  }
}

Bests, Hugo

akuegel commented 1 day ago

I think I see the problem: the MathToROCDL pass in mlir doesn't specify a lowering for F32, and the default for BF16 ops is that we convert to F32 and use the lowering for F32. This doesn't work in this case. @draganmladjenovic can you maybe take a look at this?

akuegel commented 1 day ago

Seems related to https://github.com/llvm/llvm-project/pull/102971 It should be verified whether this patch actually makes sense? I would have thought that if intrinsics exist, then in the end it would also be lowered to them.

pifon2a commented 1 day ago

https://github.com/llvm/llvm-project/pull/102971 does not have any tests for bf16. I think it can be fixed in mlir upstream using a pattern that uses logic similar to maybeCast in https://source.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Conversion/GPUCommon/OpToFuncCallLowering.h;rcl=699896658;l=98