nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

`iree-codegen-iree-comprehensive-bufferize` genereates `memref`s with dynamic offset #847

Closed makslevental closed 1 month ago

makslevental commented 1 month ago

https://github.com/nod-ai/iree-amd-aie/pull/845 is blocked because at that commit of IREE, iree-codegen-iree-comprehensive-bufferize generates memrefs with dynamic offsets and we get an error here.

@MaheshRavishankar any clue what changed recently that might produce this behavior? Possibly @pashu123 might be able to give a hint (I'm seeing recent changes in git-blame...).

cc @jtuyls @yzhang93 @newling @Abhishek-Varma

Failing snippet follows; what stands out to me as odd/a clue is that hal.interface.binding.subspan now has a memref.assume_alignment with a dynamic offset:

func.func @mm_in_bf16_out_f32_dispatch_0_matmul_64x64x64_bf16xbf16xf32() attributes {translation_info = #iree_codegen.translation_info<Custom>} {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %alloc = memref.alloc() : memref<1x1x8x4x8x4xbf16, 2 : i32>
  %alloc_0 = memref.alloc() : memref<1x1x4x8x4x8xbf16, 2 : i32>
  %alloc_1 = memref.alloc() : memref<1x2x32x32xbf16, 1 : i32>
  %alloc_2 = memref.alloc() : memref<2x1x32x32xbf16, 1 : i32>
  %alloc_3 = memref.alloc() : memref<2x2x8x8x4x4xf32, 2 : i32>
  %alloc_4 = memref.alloc() : memref<2x2x32x32xf32, 1 : i32>
  %0:3 = util.assume.int 
      %c0<umin = 0, umax = 0>, 
      %c0<umin = 0, umax = 0>, 
      %c0<umin = 0, umax = 0>
    : index, index, index
  %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%0#0) flags("ReadOnly|Indirect") : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %1, 1 : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%0#1) flags("ReadOnly|Indirect") : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %2, 1 : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%0#2) flags(Indirect) : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %3, 1 : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
  scf.forall (%arg0, %arg1) = (0, 0) to (64, 64) step (64, 64) {
    %subview = memref.subview %1[%arg0, 0] [64, 64] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %subview_5 = memref.subview %2[0, %arg1] [64, 64] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %subview_6 = memref.subview %3[%arg0, %arg1] [64, 64] [1, 1] : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %subview_7 = memref.subview %subview[0, 0] [64, 32] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x32xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    iree_linalg_ext.pack %subview_7 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %alloc_2 : (memref<64x32xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> memref<2x1x32x32xbf16, 1 : i32>)
    %subview_8 = memref.subview %subview_5[0, 0] [32, 64] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<32x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    iree_linalg_ext.pack %subview_8 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %alloc_1 : (memref<32x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> memref<1x2x32x32xbf16, 1 : i32>)
    scf.forall (%arg2, %arg3) in (2, 2) {
      %subview_12 = memref.subview %alloc_2[%arg2, 0, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<2x1x32x32xbf16, 1 : i32> to memref<1x1x32x32xbf16, strided<[1024, 1024, 32, 1], offset: ?>, 1 : i32>
      iree_linalg_ext.pack %subview_12 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 8] into %alloc_0 : (memref<1x1x32x32xbf16, strided<[1024, 1024, 32, 1], offset: ?>, 1 : i32> memref<1x1x4x8x4x8xbf16, 2 : i32>)
      %subview_13 = memref.subview %alloc_1[0, %arg3, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<1x2x32x32xbf16, 1 : i32> to memref<1x1x32x32xbf16, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>
      iree_linalg_ext.pack %subview_13 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %alloc : (memref<1x1x32x32xbf16, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32> memref<1x1x8x4x8x4xbf16, 2 : i32>)
      %subview_14 = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
      linalg.fill ins(%cst : f32) outs(%subview_14 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>)
      linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%alloc_0, %alloc : memref<1x1x4x8x4x8xbf16, 2 : i32>, memref<1x1x8x4x8x4xbf16, 2 : i32>) outs(%subview_14 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_16: bf16, %out: f32):
        %4 = arith.extf %in : bf16 to f32
        %5 = arith.extf %in_16 : bf16 to f32
        %6 = arith.mulf %4, %5 : f32
        %7 = arith.addf %out, %6 : f32
        linalg.yield %7 : f32
      }
      %subview_15 = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
      linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%subview_14 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) outs(%subview_15 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
    } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
    %subview_9 = memref.subview %subview[0, 32] [64, 32] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x32xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    iree_linalg_ext.pack %subview_9 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %alloc_2 : (memref<64x32xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> memref<2x1x32x32xbf16, 1 : i32>)
    %subview_10 = memref.subview %subview_5[32, 0] [32, 64] [1, 1] : memref<64x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<32x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    iree_linalg_ext.pack %subview_10 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %alloc_1 : (memref<32x64xbf16, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> memref<1x2x32x32xbf16, 1 : i32>)
    scf.forall (%arg2, %arg3) in (2, 2) {
      %subview_12 = memref.subview %alloc_2[%arg2, 0, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<2x1x32x32xbf16, 1 : i32> to memref<1x1x32x32xbf16, strided<[1024, 1024, 32, 1], offset: ?>, 1 : i32>
      iree_linalg_ext.pack %subview_12 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 8] into %alloc_0 : (memref<1x1x32x32xbf16, strided<[1024, 1024, 32, 1], offset: ?>, 1 : i32> memref<1x1x4x8x4x8xbf16, 2 : i32>)
      %subview_13 = memref.subview %alloc_1[0, %arg3, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<1x2x32x32xbf16, 1 : i32> to memref<1x1x32x32xbf16, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>
      iree_linalg_ext.pack %subview_13 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %alloc : (memref<1x1x32x32xbf16, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32> memref<1x1x8x4x8x4xbf16, 2 : i32>)
      %subview_14 = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
      linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%alloc_0, %alloc : memref<1x1x4x8x4x8xbf16, 2 : i32>, memref<1x1x8x4x8x4xbf16, 2 : i32>) outs(%subview_14 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_18: bf16, %out: f32):
        %4 = arith.extf %in : bf16 to f32
        %5 = arith.extf %in_18 : bf16 to f32
        %6 = arith.mulf %4, %5 : f32
        %7 = arith.addf %out, %6 : f32
        linalg.yield %7 : f32
      }
      %subview_15 = memref.subview %alloc_4[%arg2, %arg3, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<2x2x32x32xf32, 1 : i32> to memref<1x1x32x32xf32, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>
      iree_linalg_ext.unpack %subview_14 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %subview_15 : (memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32> memref<1x1x32x32xf32, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>)
      %subview_16 = memref.subview %alloc_3[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
      linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%subview_14 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) outs(%subview_16 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
      %subview_17 = memref.subview %alloc_4[%arg2, %arg3, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : memref<2x2x32x32xf32, 1 : i32> to memref<1x1x32x32xf32, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>
      linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%subview_15 : memref<1x1x32x32xf32, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>) outs(%subview_17 : memref<1x1x32x32xf32, strided<[2048, 1024, 32, 1], offset: ?>, 1 : i32>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
    } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
    iree_linalg_ext.unpack %alloc_4 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %subview_6 : (memref<2x2x32x32xf32, 1 : i32> memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>)
    %subview_11 = memref.subview %3[%arg0, %arg1] [64, 64] [1, 1] : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_6 : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%subview_11 : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    }
  } {mapping = [#gpu.block<y>, #gpu.block<x>]}
  linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%3 : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%3 : memref<64x64xf32, strided<[64, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
  ^bb0(%in: f32, %out: f32):
    linalg.yield %in : f32
  }
  memref.dealloc %alloc_4 : memref<2x2x32x32xf32, 1 : i32>
  memref.dealloc %alloc_3 : memref<2x2x8x8x4x4xf32, 2 : i32>
  memref.dealloc %alloc_2 : memref<2x1x32x32xbf16, 1 : i32>
  memref.dealloc %alloc_1 : memref<1x2x32x32xbf16, 1 : i32>
  memref.dealloc %alloc_0 : memref<1x1x4x8x4x8xbf16, 2 : i32>
  memref.dealloc %alloc : memref<1x1x8x4x8x4xbf16, 2 : i32>
  return
}
pashu123 commented 1 month ago

I've made a change to duplicate Empty tensor ops here: https://github.com/iree-org/iree/blob/05bbcf1385146d075829cd940a52bf06961614d0/compiler/src/iree/compiler/Codegen/Common/IREEComprehensiveBufferizePass.cpp#L177 Since, we are not using the destination passing style as a preprocessing for distribute-using-for-all we had to make that decision. If your pipeline uses convert-to-destination passing style pass, then it shouldn't make a difference. @MaheshRavishankar, do you think the error might be caused by the change mentioned?

yzhang93 commented 1 month ago

I've made a change to duplicate Empty tensor ops here: https://github.com/iree-org/iree/blob/05bbcf1385146d075829cd940a52bf06961614d0/compiler/src/iree/compiler/Codegen/Common/IREEComprehensiveBufferizePass.cpp#L177 Since, we are not using the destination passing style as a preprocessing for distribute-using-for-all we had to make that decision. If your pipeline uses convert-to-destination passing style pass, then it shouldn't make a difference. @MaheshRavishankar, do you think the error might be caused by the change mentioned?

No, I don't think the error is caused by your change.

The reason is like @makslevental mentioned because of

%0:3 = util.assume.int 
      %c0<umin = 0, umax = 0>, 
      %c0<umin = 0, umax = 0>, 
      %c0<umin = 0, umax = 0>
    : index, index, index
  %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%0#0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<128x128xi32>>
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%0#1) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<128x128xi32>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%0#2) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<128x128xi32>>

It generates memref.assume_alignment with a dynamic offset after bufferization.

I don't know how to get rid of the dynamic offsets, but if we remove this check for now, then we can proceed without problem.

MaheshRavishankar commented 1 month ago

Maybe you just need to drop those hints using this pass https://github.com/MaheshRavishankar/iree/blob/6950dc0a5a2e6af2d8ba18e323534df72df984ad/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L822

yzhang93 commented 1 month ago

I think Stella's optimization PRs from yesterday solved the problem, my local build with new iree bump works. I'll update the branch later after fixing some other conflicts.

makslevental commented 1 month ago

I think Stella's optimization PRs from yesterday solved the problem, my local build with new iree bump works. I'll update the branch later after fixing some other conflicts.

that's like two wrongs make a right lol. cool.

MaheshRavishankar commented 1 month ago

I think Stella's optimization PRs from yesterday solved the problem, my local build with new iree bump works. I'll update the branch later after fixing some other conflicts.

that's like two wrongs make a right lol. cool.

Hey maybe this is two rights!!

makslevental commented 1 month ago

Fixed by https://github.com/nod-ai/iree-amd-aie/pull/845