Two WIP regression fixes (outlining + colaesing)

newling commented 1 week ago

This PR has 2 changes

(1) remove coalescing in vectorization pass (2) do not perform outlining

(1) and (2) were introduced in https://github.com/nod-ai/iree-amd-aie/pull/856

The effects of these 2 changes together and individually on current Top of Main (ToM) for matmul with m=n=512 k=4096:

ToM : 49 [ms] ToM - (1) : 39 [ms] ToM - (2) : 41 [ms] ToM - (1) - (2) : 17 [ms].

This PR needs work: some lit tests need to updated, and I think Abhishek mentioned that he wanted to keep outlining on for matmul+truncf (?)

What's up with coalescing causing a regression?

Here's my interpretation: Coalescing the loops results in the single index being decomposed as

 %43:3 = affine.delinearize_index %arg4 into (%c8, %c8, %c4) : index, index, index

With %43#0, %43#1, and %43#2 being used as offsets in the vector.transfer_read ops.

This affine op eventually gets updated in lower-affine into about 20 arith ops, which later lower to llvm scalar ops. The full compute block is included below.

This is the big difference I can see that the change to vectorization in https://github.com/nod-ai/iree-amd-aie/pull/856 introduces.

       %332 = llvm.icmp "slt" %330, %7 : i64
      %333 = llvm.sub %17, %330 : i64
      %334 = llvm.select %332, %333, %330 : i1, i64
      %335 = llvm.sdiv %334, %18  : i64
      %336 = llvm.sub %17, %335 : i64
      %337 = llvm.select %332, %336, %335 : i1, i64
      %338 = llvm.srem %330, %18  : i64
      %339 = llvm.icmp "slt" %338, %7 : i64
      %340 = llvm.add %338, %18 : i64
      %341 = llvm.select %339, %340, %338 : i1, i64
      %342 = llvm.icmp "slt" %341, %7 : i64
      %343 = llvm.sub %17, %341 : i64
      %344 = llvm.select %342, %343, %341 : i1, i64
      %345 = llvm.sdiv %344, %9  : i64
      %346 = llvm.sub %17, %345 : i64
      %347 = llvm.select %342, %346, %345 : i1, i64
      %348 = llvm.srem %330, %9  : i64
      %349 = llvm.icmp "slt" %348, %7 : i64
      %350 = llvm.add %348, %9 : i64
      %351 = llvm.select %349, %350, %348 : i1, i64
      %352 = llvm.mul %351, %3 : i64
      %353 = llvm.mul %337, %18 : i64
      %354 = llvm.add %352, %353 : i64
      %355 = llvm.extractvalue %30[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %356 = llvm.getelementptr %355[%354] : (!llvm.ptr, i64) -> !llvm.ptr, bf16
      %357 = llvm.load %356 : !llvm.ptr -> vector<32xbf16>
      %358 = llvm.mul %347, %16 : i64
      %359 = llvm.mul %351, %18 : i64
      %360 = llvm.add %358, %359 : i64
      %361 = llvm.extractvalue %31[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %362 = llvm.getelementptr %361[%360] : (!llvm.ptr, i64) -> !llvm.ptr, bf16
      %363 = llvm.load %362 : !llvm.ptr -> vector<32xbf16>
      %364 = llvm.mul %337, %5 : i64
      %365 = llvm.add %358, %364 : i64
      %366 = llvm.extractvalue %32[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> 
      %367 = llvm.getelementptr %366[%365] : (!llvm.ptr, i64) -> !llvm.ptr, f32
      %368 = llvm.load %367 {alignment = 4 : i64} : !llvm.ptr -> vector<16xf32>
      %369 = llvm.bitcast %368 : vector<16xf32> to vector<8xi64>
      %370 = "xllvm.intr.aie2.bf.mac16.conf"(%357, %363, %369, %15) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64>
      %371 = llvm.bitcast %370 : vector<8xi64> to vector<16xf32>

What's up with outlining causing a regression?

This is pretty obvious I guess : function call overhead.

Abhishek-Varma commented 1 week ago

Hi @newling !

Thank you so much for this amazing find!

I've addressed the comments in https://github.com/nod-ai/iree-amd-aie/commits/avarma_james_func_outline/ :-

8ccc95196 - does the selective enabling of function outlining.
3b72c54ef - updates the lit tests for insert-loops-for-vectorization.

Both these can be cherry-picked on this PR's branch newling:regression_fixes (couldn't do it as I don't have access to the same) - would be nice to get this bug fix in from https://github.com/nod-ai/iree-amd-aie/pull/888 first though.

NOTE: For the selective function outlining - I thought of making changes by adding a pass flag - but that didn't seem the right thing to do as, given a random dispatch, it didn't seem quite generic enough. Since this would be invoked after bufferization, the use-def chain is not easy to traverse as everything is just memref (so getDefiningOp() renders useless). Even thought of adding a change in KernelDispatch to figure out - but that turned out to be quite involed.

Therefore I'm just being vocal about the approach I've taken for this selective outlining for "Matmul + Truncf" - you may re-enable it in the pipeline as it was earlier.

newling commented 1 week ago

@Abhishek-Varma Maybe now is a good time to generalize the logic for truncf to any unary elementwise operation?

newling commented 1 week ago

@Abhishek-Varma and can you please make a new standalone PR for https://github.com/nod-ai/iree-amd-aie/commit/3b72c54ef654d631f9aa8d087b3d5cf768ee87e0 ?

nod-ai / iree-amd-aie

Two WIP regression fixes (outlining + colaesing) #895

What's up with coalescing causing a regression?

What's up with outlining causing a regression?