nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

Coalescing loops #907

Open newling opened 5 days ago

newling commented 5 days ago

For iree-amd-aie's matmul pipeline, this is state we're in with coalescing enabled, just before affine-to-standard pass

#map = affine_map<()[s0, s1] -> (s0 * 256 + s1 * 32)>
#map1 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 32)>
#map2 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 16)>
...

scf.for %arg1 = %c0 to %c256 step %c1 {
  %0:3 = affine.delinearize_index %arg1 into  (8, 8, 4) : index, index, index
  %1 = affine.apply #map()[%0#2, %0#0]
  %3 = affine.apply #map1()[%0#1, %0#2]
  %5 = affine.apply #map2()[%0#1, %0#0]
  %2 = vector.transfer_read %collapse_shape_85[%1], %cst {in_bounds = [true]} : memref<1024xbf16, strided<[1]>>, vector<32xbf16>
  %4 = vector.transfer_read %collapse_shape_86[%3], %cst {in_bounds = [true]} : memref<1024xbf16, strided<[1]>>, vector<32xbf16>
  %6 = vector.transfer_read %collapse_shape_87[%5], %cst_50 {in_bounds = [true]} : memref<1024xf32, strided<[1]>>, vector<16xf32>
  ...
}

And if there is no coalescing, the IR looks like:

#map = affine_map<()[s0, s1] -> (s0 * 256 + s1 * 32)>
#map1 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 32)>
#map2 = affine_map<()[s0, s1] -> (s0 * 128 + s1 * 16)>
...
scf.for %arg1 = %c0 to %c8 step %c1 {
  scf.for %arg2 = %c0 to %c8 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %1 = affine.apply #map()[%arg3, %arg1]
      %3 = affine.apply #map1()[%arg2, %arg3]
      %0 = affine.apply #map2()[%arg2, %arg1]
      %2 = vector.transfer_read %collapse_shape[%1], %cst {in_bounds = [true]} : memref<1024xbf16>, vector<32xbf16>
      %4 = vector.transfer_read %collapse_shape_53[%3], %cst {in_bounds = [true]} : memref<1024xbf16>, vector<32xbf16>
      %5 = vector.transfer_read %collapse_shape_54[%0], %cst_50 {in_bounds = [true]} : memref<1024xf32>, vector<16xf32>
      ...
    }
  }
}

Question: why do we want to coalesce here in the first place, afaict it's never possible to have as little index arithmetic in the coalesced case as the uncoalesced case. You might be to convert the coalesced case to something without modular arithmetic, but you'll still need some division shenanigans, which you don't need if you don't coalesce -- without colaescing, %arg1 %arg2 and %arg3 are exactly what we want, and the 3 loops get translated to simple add-compare logic in scf-to-sf. The only reason I can think we would want to coalesce is that after convert-scf-to-cf you then have a cf.cond_br logic which is "simple": just a single variable iterating to 256, rather than a waterfall of 3 counters (to 8, 8, and 4 respectively). But why is this desirable, how does it help llvm/peano?

(@MaheshRavishankar @Abhishek-Varma)