Numeric issues on AMDGPU with f32 elementwise mul + f16 trunc

nirvedhmeshram commented 2 weeks ago

For this elementwise + pad dispatch

      func.func @main(%8 : tensor<2x640x128x128xf32>, %9 : tensor<640xf32>) -> tensor<2x130x130x640xf16> {
        %c0_f16 = arith.constant 0.0 : f16
        %13 = tensor.empty() : tensor<2x128x128x640xf16>
        %14 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d3, d1, d2)>, affine_map<(d0, d1, d2, d3) -> (d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%8, %9: tensor<2x640x128x128xf32>, tensor<640xf32>) outs(%13 : tensor<2x128x128x640xf16>) {
        ^bb0(%in: f32, %in_2: f32, %out: f16):
          %16 = arith.mulf %in, %in_2 : f32
          %19 = arith.truncf %16 : f32 to f16
          linalg.yield %19 : f16
        } -> tensor<2x128x128x640xf16>
        %pad = tensor.pad %14 low[0, 1, 1, 0] high[0, 1, 1, 0]  {
          ^bb0(%arg0: index, %arg1: index, %arg2: index, %arg3: index):
          tensor.yield %c0_f16 : f16
        } : tensor<2x128x128x640xf16> to tensor<2x130x130x640xf16>
        return %pad : tensor<2x130x130x640xf16>
      }

there were numeric issues between all three of these pipelines/lowering configs

1. translation_info = #iree_codegen.translation_info<LLVMGPUVectorize workgroup_size = [128, 1, 1] subgroup_size = 64>
    lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1, 1, 1, 512]]>
2. translation_info = #iree_codegen.translation_info<LLVMGPUTileAndFuse workgroup_size = [64, 1, 1] subgroup_size = 64>}
    lowering_config = #iree_gpu.lowering_config<{thread = [1, 1, 1, 4], workgroup = [1, 1, 4, 64]}>
3. translation_info = #iree_codegen.translation_info<LLVMGPUTileAndFuse workgroup_size = [64, 1, 1] subgroup_size = 64>
   lowering_config = #iree_gpu.lowering_config<{thread = [1, 1, 1, 1], workgroup = [1, 1, 4, 64]}>

We will refer to these as vec , taf and taf_single respectively going forward in the issue and the provided artifacts.

This gist provides annotated IRs (with lowering/translation info) that you can then compile with the following commands to get vmfbs to do the numeric comparisons

iree-compile -o problem_elem2_module.vmfb problem_elem2_exec.mlir --compile-from=executable-sources --iree-hal-target-backends=rocm --iree-hip-target=gfx942  --mlir-print-ir-after-all --mlir-disable-threading --print-after-all --iree-hal-dump-executable-intermediates-to=problem_elem2_module  &> problem_elem2_module_dump.mlir

# cpu refrerence
iree-compile -o problem_elem2_cpu.vmfb problem_elem2.mlir  --iree-hal-target-backends=llvm-cpu

Next you can generate your own input files and corresponding ground truth with these python scripts

Next. you can use these run commands

# splat
iree-run-module --module=problem_elem2_module.vmfb --function=vec --device=hip --input=2x640x128x128xf32=10.9897868123 --input=640xf32=-5.678123 --output=@problem_elem2_vectorize_fixed.npy
iree-run-module --module=problem_elem2_module.vmfb --function=taf --device=hip --input=2x640x128x128xf32=10.9897868123 --input=640xf32=-5.678123 --output=@problem_elem2_tileandfuse_fixed.npy
iree-run-module --module=problem_elem2_module.vmfb --function=taf_single --device=hip --input=2x640x128x128xf32=10.9897868123 --input=640xf32=-5.678123  --output=@problem_elem2_tileandfuse_singlethread_fixed.npy
# non-splat
iree-run-module --module=problem_elem2_module.vmfb --function=vec  --device=hip --input=2x640x128x128xf32=@2x640x128x128.0.bin --input=640xf32=-3.0 --output=@problem_elem2_vectorize.npy
iree-run-module --module=problem_elem2_module.vmfb  --function=taf --device=hip --input=2x640x128x128xf32=@2x640x128x128.0.bin --input=640xf32=-3.0 --output=@problem_elem2_tileandfuse.npy
iree-run-module --module=problem_elem2_module.vmfb --function=taf_single  --device=hip --input=2x640x128x128xf32=@2x640x128x128.0.bin --input=640xf32=-3.0  --output=@problem_elem2_tileandfuse_singlethread.npy
# cpu
iree-run-module --module=problem_elem2_cpu.vmfb --device=local-task --input=2x640x128x128xf32=@2x640x128x128.0.bin --input=640xf32=-3.0  --output=@problem_elem2_cpu.npy

The splat ones have no issues, and you can use this comparison script to see issues with non-splat

You can see the numeric discrepancies I saw here

nirvedhmeshram commented 2 weeks ago

Based on some initial investigation the issue comes down to the trucation instructions we make see here taf does tuncation using

  v_pk_mul_f32 v[0:1], v[20:21], v[0:1]
  s_nop 0
  v_cvt_f16_f32_e32 v12, v1
  v_cvt_f16_f32_e32 v13, v0

due to generating vector llvm ops as input from mlir but the other two that make scalar llvm ops use

v_fma_mixlo_f16 v8, v8, v9, 0

based on the numerics the taf output matches ground truth and in the other two a first use of v_fma_mixlo_f16 also seems to match groundtruth but then when a second v_fma_mixlo_f16 is used with the same destination register we observe the numeric issue.

nirvedhmeshram commented 2 weeks ago

I can confirm that the issue is with the zeroing semantics of v_fma_mixlo_f16 which is basically acknowledged here https://github.com/llvm/llvm-project/blob/ac0f64f06d67a93817ccd9a3c529ad40920115c9/llvm/lib/Target/AMDGPU/SIInstructions.td#L2835-L2843 Since this is not stable we can disable the use of mixed precsison fma instructions in IREE with

features += "-fma-mix-insts";

This way we can have the following instruction

    v_mul_f32_e32 v8, v8, v9
    v_cvt_f16_f32_e32 v8, v8

which is correct for mi300,

Ideally there needs to be an extra zeroing instruction after the v_fma_mixlo_f16 v8, v8, v9, 0 that the backend needs to generate but I dont think its much of a performance hit to just use v_mul_f32_e32 v8, v8, v9

iree-org / iree

Numeric issues on AMDGPU with f32 elementwise mul + f16 trunc #18746