Closed newling closed 1 month ago
I'll have a thorough review tomorrow. Just a high level question, does this new strategy also work with AIR pipeline?
Thanks. No, it doesn't get through compilation with the AIR pipeline.
@erwei-xilinx @yzhang93 I am making changes to the tiling strategy today, so please don't review this in detail yet. I will post IR once I have channel-last vectorizing correctly.
Reporting current status with current code checked in, mostly to keep track for myself.
The vectorization looks good, compilation to vmfb successful. vector.contract -> aievec.matmul, no discontiguous transfer_reads or transfer_writes. Good.
But numerically incorrect. Values all quite close (no zeros).
If I make the input tensor be all 1s (but kernel still random values) then numerically correct.
If I comment out the pass AMDAIEVectorization, numerically correct (with input tensor and kernel having random values).
The AIEVec passes look like they're doing the correct thing, I can't see any problem with aievec.matmul, or lowering to LLVM.
So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?
Or is the problem somewhere else, and it's just having a bad interaction with AMDAIEVectorization?
So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?
It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.
So what's going on, is this a problem with peano? What experiments can I run to test this hypothesis?
It's possible but I feel like it's unlikely - they run a large test suite (somewhere) nightly which (I'm assuming) checks NA.
Ok. The difference does appear to be vectorization/below. Replace this:
scf.for %arg4 = %c0 to %c3 step %c1 {
scf.for %arg5 = %c0 to %c3 step %c1 {
%collapse_shape = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
%25 = vector.transfer_read %reinterpret_cast_8[%c0, %arg4, %c0, %arg5, %c0], %cst {in_bounds = [true, true]} : memref<1x3x1x6x8xbf16, 2 : i32>, vector<4x8xbf16>
%26 = vector.transfer_read %reinterpret_cast_9[%arg4, %arg5, %c0, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x1x1x8x4xbf16, 2 : i32>, vector<8x4xbf16>
%27 = vector.transfer_read %collapse_shape[%c0, %c0], %cst_0 {in_bounds = [true, true]} : memref<4x4xf32, 2 : i32>, vector<4x4xf32>
%28 = arith.extf %25 : vector<4x8xbf16> to vector<4x8xf32>
%29 = arith.extf %26 : vector<8x4xbf16> to vector<8x4xf32>
%30 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %28, %29, %27 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
vector.transfer_write %30, %collapse_shape[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf32>, memref<4x4xf32, 2 : i32>
}
}
with this:
scf.for %arg4 = %c0 to %c3 step %c1 {
scf.for %arg5 = %c0 to %c3 step %c1 {
%subview = memref.subview %reinterpret_cast_7[0, %arg4, 0, %arg5, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x1x6x8xbf16, 2 : i32> to memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32>
%collapse_shape = memref.collapse_shape %subview [[0, 1, 2, 3], [4]] : memref<1x1x1x4x8xbf16, strided<[144, 48, 48, 8, 1], offset: ?>, 2 : i32> into memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>
%subview_9 = memref.subview %reinterpret_cast_8[%arg4, %arg5, 0, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x1x1x8x4xbf16, 2 : i32> to memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32>
%collapse_shape_10 = memref.collapse_shape %subview_9 [[0, 1, 2, 3, 4], [5]] : memref<1x1x1x1x8x4xbf16, strided<[96, 32, 32, 32, 4, 1], offset: ?>, 2 : i32> into memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>
%collapse_shape_11 = memref.collapse_shape %reinterpret_cast [[0, 1, 2, 3], [4]] : memref<1x1x4x1x4xf32, 2 : i32> into memref<4x4xf32, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapse_shape, %collapse_shape_10 : memref<4x8xbf16, strided<[8, 1], offset: ?>, 2 : i32>, memref<8x4xbf16, strided<[4, 1], offset: ?>, 2 : i32>) outs(%collapse_shape_11 : memref<4x4xf32, 2 : i32>) {
^bb0(%in: bf16, %in_12: bf16, %out: f32):
%25 = arith.extf %in : bf16 to f32
%26 = arith.extf %in_12 : bf16 to f32
%27 = arith.mulf %25, %26 : f32
%28 = arith.addf %out, %27 : f32
linalg.yield %28 : f32
}
}
}
And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.
@newling Can we have a full review of the current state of convolution ops in the next AIE sync? CC @MaheshRavishankar
And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.
I made progress on this yesterday but I'm still blocked by some issue about MEMTAB
in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.
And the numerics are fixed. Maybe it's time for me to try the simulator you mentioned @makslevental.
I made progress on this yesterday but I'm still blocked by some issue about
MEMTAB
in the elf file. But I'm hoping to be able to land the PR (along with the new test suite) this week before I go back to HAL stuff next week.
Ok, thanks, I'll keep an eye out for that.
@yzhang93 this is ready for review again if you'd like to take a look
This PR switches all numerical convolution tests to use the objectFifo pipeline. With respect to the new tiling strategy:
1) A single column is currently used. Targeting multiple columns results in
error: 'aie.memtile_dma' op could not find and assign a valid BD id
. This will will be investigated as follow-up work: https://github.com/nod-ai/iree-amd-aie/issues/8212) There is no longer interleaving of compute and L2->L1 data movement, which means https://github.com/nod-ai/iree-amd-aie/issues/619 becomes low priority / obsolete
3) L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing.
4) Channel-first convolution is completely unsupported, we expect high level transforms to convert to channel last before reaching our backend.
5) Vectorization is not currently enabled, due to issues with alignment. See follow-up task https://github.com/nod-ai/iree-amd-aie/issues/820. This is functionally ok for now, as peano can scalarize code for all data types.