nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

[Convolution] Packing + objectFifo, initial support #789

Closed newling closed 1 month ago

newling commented 2 months ago

This PR switches all numerical convolution tests to use the objectFifo pipeline. With respect to the new tiling strategy:

1) A single column is currently used. Targeting multiple columns results in error: 'aie.memtile_dma' op could not find and assign a valid BD id. This will will be investigated as follow-up work: https://github.com/nod-ai/iree-amd-aie/issues/821

2) There is no longer interleaving of compute and L2->L1 data movement, which means https://github.com/nod-ai/iree-amd-aie/issues/619 becomes low priority / obsolete

3) L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing.

4) Channel-first convolution is completely unsupported, we expect high level transforms to convert to channel last before reaching our backend.

5) Vectorization is not currently enabled, due to issues with alignment. See follow-up task https://github.com/nod-ai/iree-amd-aie/issues/820. This is functionally ok for now, as peano can scalarize code for all data types.

yzhang93 commented 2 months ago

I'll have a thorough review tomorrow. Just a high level question, does this new strategy also work with AIR pipeline?

Thanks. No, it doesn't get through compilation with the AIR pipeline.

newling commented 1 month ago

@erwei-xilinx @yzhang93 I am making changes to the tiling strategy today, so please don't review this in detail yet. I will post IR once I have channel-last vectorizing correctly.

newling commented 1 month ago

Reporting current status with current code checked in, mostly to keep track for myself.

The vectorization looks good, compilation to vmfb successful. vector.contract -> aievec.matmul, no discontiguous transfer_reads or transfer_writes. Good.

But numerically incorrect. Values all quite close (no zeros).

If I make the input tensor be all 1s (but kernel still random values) then numerically correct.

If I comment out the pass AMDAIEVectorization, numerically correct (with input tensor and kernel having random values).

The AIEVec passes look like they're doing the correct thing, I can't see any problem with aievec.matmul, or lowering to LLVM.

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

Or is the problem somewhere else, and it's just having a bad interaction with AMDAIEVectorization?

makslevental commented 1 month ago

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.

newling commented 1 month ago

So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?

It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.

Ok. The difference does appear to be vectorization/below. Replace this:

scf.for %arg4 = %c0 to %c3 step %c1 {
  scf.for %arg5 = %c0 to %c3 step %c1 {
    %collapse_shape = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
    %25 = vector.transfer_read %reinterpret_cast_8[%c0, %arg4, %c0, %arg5, %c0], %cst {in_bounds = [true, true]} : memref<1x3x1x6x8xbf16, 2 : i32>, vector<4x8xbf16>
    %26 = vector.transfer_read %reinterpret_cast_9[%arg4, %arg5, %c0, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x1x1x8x4xbf16, 2 : i32>, vector<8x4xbf16>
    %27 = vector.transfer_read %collapse_shape[%c0, %c0], %cst_0 {in_bounds = [true, true]} : memref<4x4xf32, 2 : i32>, vector<4x4xf32>
    %28 = arith.extf %25 : vector<4x8xbf16> to vector<4x8xf32>
    %29 = arith.extf %26 : vector<8x4xbf16> to vector<8x4xf32>
    %30 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %28, %29, %27 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
    vector.transfer_write %30, %collapse_shape[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf32>, memref<4x4xf32, 2 : i32>
  }
}

with this:

scf.for %arg4 = %c0 to %c3 step %c1 {
  scf.for %arg5 = %c0 to %c3 step %c1 {
    %subview = memref.subview %reinterpret_cast_7[0, %arg4, 0, %arg5, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x1x6x8xbf16, 2 : i32> to memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32>
    %collapse_shape = memref.collapse_shape %subview [[0, 1, 2, 3], [4]] : memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32> into memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>
    %subview_9 = memref.subview %reinterpret_cast_8[%arg4, %arg5, 0, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x1x1x8x4xbf16, 2 : i32> to memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32>
    %collapse_shape_10 = memref.collapse_shape %subview_9 [[0, 1, 2, 3, 4], [5]] : memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32> into memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>
    %collapse_shape_11 = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
    linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapse_shape, %collapse_shape_10 : memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>, memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>) outs(%collapse_shape_11 : memref<4x4xf32, 2 : i32>) {
    ^bb0(%in: bf16, %in_12: bf16, %out: f32):
      %25 = arith.extf %in : bf16 to f32
      %26 = arith.extf %in_12 : bf16 to f32
      %27 = arith.mulf %25, %26 : f32
      %28 = arith.addf %out, %27 : f32
      linalg.yield %28 : f32
    }
  }
}

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

yzhang93 commented 1 month ago

@newling Can we have a full review of the current state of convolution ops in the next AIE sync? CC @MaheshRavishankar

makslevental commented 1 month ago

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

I made progress on this yesterday but I'm still blocked by some issue about MEMTAB in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.

newling commented 1 month ago

And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.

I made progress on this yesterday but I'm still blocked by some issue about MEMTAB in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.

Ok, thanks, I'll keep an eye out for that.

newling commented 1 month ago

@yzhang93 this is ready for review again if you'd like to take a look