nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

Numerics issue with vectorized conv2d #820

Closed newling closed 2 weeks ago

newling commented 1 month ago

Enabling vectorization (see https://github.com/nod-ai/iree-amd-aie/pull/789) for convolution results in numerical failure. The values are off only slightly (although they are definitely not correct, there is no floating point rounding issue here). Experiments with the input values suggests that the problem is that the input image data (i.e. not the kernel data) is being read incorrectly, with an incorrect offset inside the scf.for loop (not confirmed).

Some things I've tried:

Setting optimization flags in LLVM to 0 (the default is 2) has no effect. Inverting the scf.for loop order has no effect. Using different versions of peano has no effect.

This task is to find the source of the error, and if it's peano create a reproducer for the team.

Attached are the ll and opt.ll files (vectorized and unvectorized) : ll_files.zip (they're quite small, vectorized_input.opt.ll is 94 lines only)

The MLIR IR looks basically identical except for the inner-loop.

// With vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read %reinterpret_cast_21[%c0, %arg1, %arg3, %arg2, %c0], %cst {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      %1 = vector.transfer_read %reinterpret_cast_22[%arg1, %arg2, %arg3, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x4x1x8x4xbf16>, vector<8x4xbf16>
      %2 = vector.transfer_read %reinterpret_cast[%c0, %c0, %c0, %c0, %c0], %cst_20 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : memref<1x1x4x1x4xf32>, vector<4x4xf32>
      %3 = arith.extf %0 : vector<4x8xbf16> to vector<4x8xf32>
      %4 = arith.extf %1 : vector<8x4xbf16> to vector<8x4xf32>
      %5 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %3, %4, %2 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
      vector.transfer_write %5, %reinterpret_cast[%c0, %c0, %c0, %c0, %c0] {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : vector<4x4xf32>, memref<1x1x4x1x4xf32>
    }
  }
}

// Without vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %subview = memref.subview %reinterpret_cast_20[%arg1, %arg2, %arg3, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x4x1x8x4xbf16> to memref<8x4xbf16, strided<[4, 1], offset: ?>>
      %subview_25 = memref.subview %reinterpret_cast[0, 0, 0, 0, 0] [1, 1, 4, 1, 4] [1, 1, 1, 1, 1] : memref<1x1x4x1x4xf32> to memref<4x4xf32, strided<[4, 1]>>
      %subview_26 = memref.subview %reinterpret_cast_21[0, %arg1, %arg3, %arg2, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x4x6x8xbf16> to memref<4x8xbf16, strided<[8, 1], offset: ?>>
      scf.for %arg4 = %c0 to %c4 step %c1 {
        scf.for %arg5 = %c0 to %c4 step %c1 {
          scf.for %arg6 = %c0 to %c8 step %c1 {
            %0 = memref.load %subview_26[%arg4, %arg6] : memref<4x8xbf16, strided<[8, 1], offset: ?>>
            %1 = memref.load %subview[%arg6, %arg5] : memref<8x4xbf16, strided<[4, 1], offset: ?>>
            %2 = memref.load %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
            %3 = arith.extf %0 : bf16 to f32
            %4 = arith.extf %1 : bf16 to f32
            %5 = arith.mulf %3, %4 : f32
            %6 = arith.addf %2, %5 : f32
            memref.store %6, %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
          }
        }
      }
    }
  }
}

In the vectorized case, the vector.contract gets lowered to a aievec.matmul which in turn gets lowered to


 %38 = "xllvm.intr.aie2.bf.mac16.conf"(%33, %34, %37, %36) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64> 
 ``
newling commented 1 month ago

I've narrowed it down to the alignment of the loads : if the alignments are set/forced sufficiently low, then the numerics with vectorization are fine. I am a bit perplexed why the alignment needs to be as low as it does, consider the 2 IRs in the zip file: opts.zip

They are basically identical, except for some loads which have alignment 4 in the one file, and alignment 2 in the other. The case with alignment 2 gives numerically correct results, the one with alignment 4 does not. What I find confusing is that alignment 4 is surely enough here: none of the strides in any of the loads is less than 8.

newling commented 1 month ago

Running the above 2 IRs through compiler explorer

http://xsjsda132:10240/z/6qc4cc

align 2: with_align_2.txt align 4: with_align_4.txt

makslevental commented 2 weeks ago

congrats @newling for tracking this all the day down