Split L2 input and output objectFifos for memTile/shimTile distribution

nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator

Apache License 2.0

69 stars 30 forks source link

Split L2 input and output objectFifos for memTile/shimTile distribution #903

Closed yzhang93 closed 1 day ago

yzhang93 commented 1 week ago

This is the first PR needed to enable 4x4 AIE array. The L2 objectFifos are split to distribute on multiple memTiles and shimTiles for more channel usage. The shim/mem tile reassignment will be addressed in a separate PR.

Also note this PR doesn't change/combine the previous pass that splitting the third input (elementwise) for connection reuse with mamtul ops. What unclear to me is how matmul-elementwise path will be like for 4x4 AIE cores. If the existing logic is kept, we'll have to split the elementwise op to 16 objectFifos which will be a big challenge given the number of shimTile channels . I'd rather leave it as a separate thing for now before we have a clear path to move forward.

@jtuyls I've simplified the functions a bit based on your initial version. Feel free to make modifications/push new commits.

newling commented 6 days ago

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have

 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>

I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

yzhang93 commented 3 days ago

It would be nice to have include a motivation for this in the description. I imagine it is to split tensors equally across all memory tiles, rather than putting them entirely on one? If this is the case, are you assuming nrows = ncols in choosing splitFactor? If the array is 4x4 and you have
 %alloc_0 = memref.alloc() : memref<4x1x32x32xi32, 1 : i32>
 %alloc_1 = memref.alloc() : memref<1x4x32x32xi32, 1 : i32>
I guess it works nicely. But is the array is 4x8 (8 cols) and the allocs are 4x1x... and 1x8x... it's not optimal?

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that. Also it should work for unbalanced use of columns/rows such as A: 4x1x... B: 1x2x...., which will split A into 4 separate objectfifos and B into 2 objectfifos, but we can't control how to distribute these in this pass. I think it's the tiling strategy and the tile assignment strategy that should responsible for the distribution part.

yzhang93 commented 1 day ago

Yes, currently it aims for balance use of number of rows and columns (i.g., 2x2/4x4) and added basic support for that

It would be great if you can add a check on this in the code. My only remaining requests are

document the assumed relationship between first 2 dims of tensors and nrows, nols

add tests/safeguards for the cases nrows != ncols

I've added more comments for the current assumptions. I'm not adding more tests specially for nrows != ncols, will leave it in the follow ups when the split factor is only depend on the ncols not the way it's hardcoded right now.