mlir/Affine: sibling loop fusion with `fusion-maximal` on leads to invalid optimization

rohany commented 1 year ago

The following MLIR file:

func.func @f(%input : memref<10xf32>, %output : memref<10xf32>, %reduc : memref<10xf32>) {
  %zero = arith.constant 0. : f32
  %one = arith.constant 1. : f32
  affine.for %i = 0 to 10 {
    %0 = affine.load %input[%i] : memref<10xf32>
    %2 = arith.addf %0, %one : f32
    affine.store %2, %output[%i] : memref<10xf32>
  }
  affine.for %i = 0 to 10 {
    %0 = affine.load %input[%i] : memref<10xf32>
    %1 = affine.load %reduc[0] : memref<10xf32>
    %2 = arith.addf %0, %1 : f32
    affine.store %2, %reduc[0] : memref<10xf32>
  }
  return
}

run with

bin/mlir-opt -pass-pipeline='builtin.module(func.func(affine-loop-fusion{mode=sibling fusion-maximal}))'

produces:

module {
  func.func @f(%arg0: memref<10xf32>, %arg1: memref<10xf32>, %arg2: memref<10xf32>) {
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f32
    affine.for %arg3 = 0 to 10 {
      %0 = affine.load %arg0[%arg3] : memref<10xf32>
      %1 = arith.addf %0, %cst_0 : f32
      affine.store %1, %arg1[%arg3] : memref<10xf32>
      affine.for %arg4 = 0 to 10 {
        %2 = affine.load %arg0[%arg4] : memref<10xf32>
        %3 = affine.load %arg2[0] : memref<10xf32>
        %4 = arith.addf %2, %3 : f32
        affine.store %4, %arg2[0] : memref<10xf32>
      }
    }
    return
  }
}

This looks incorrect to me, as the reduction into %arg2[0] will occur 100 times instead of 10 times now. Without fusion-maximal, the correct fusion gets applied:

module {
  func.func @f(%arg0: memref<10xf32>, %arg1: memref<10xf32>, %arg2: memref<10xf32>) {
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f32
    affine.for %arg3 = 0 to 10 {
      %0 = affine.load %arg0[%arg3] : memref<10xf32>
      %1 = arith.addf %0, %cst_0 : f32
      affine.store %1, %arg1[%arg3] : memref<10xf32>
      %2 = affine.load %arg0[%arg3] : memref<10xf32>
      %3 = affine.load %arg2[0] : memref<10xf32>
      %4 = arith.addf %2, %3 : f32
      affine.store %4, %arg2[0] : memref<10xf32>
    }
    return
  }
}

llvmbot commented 1 year ago

@llvm/issue-subscribers-mlir-affine

caojoshua commented 1 year ago

I attempted to solve this issue, but I don't think I understand the code well enough. I'm leaving what I've found so far here.

I believe the main issue comes when reset the sliceState upper bound. The original value is (d0) -> (d0 + 1).

I can't figure out why we need to reset bounds at all. Actually, if I comment out those two lines that I linked above, all the mlir/test/Dialect/Affine tests still pass anyway. Seems like a case of LoopFusion / Affine code not being mature enough right now.

rikhuijzer commented 1 year ago

This issue also occurs when removing mode=sibling from the mlir-opt call, that is,

bin/mlir-opt -pass-pipeline='builtin.module(func.func(affine-loop-fusion{fusion-maximal}))'

and I'm not sure what should be the expected behavior here. After a bit of digging, two things seem to go wrong here:

mlir::affine::isLoopMemoryParallel concludes that both loops are not parallel while they should be. This is caused by the loop in https://github.com/llvm/llvm-project/blob/47cf7a4ba54edf7c2a78503d7cca8a003f194d32/mlir/lib/Dialect/Affine/Analysis/AffineAnalysis.cpp#L172-L174 which checks for dependencies even for operations inside the same loop. So continuing when both are in the same loop ((srcOp->getParentOp() == forOp && dstOp->getParentOp() == forOp)) could be a possible fix.
The loop fusion algorithm determines that a sibling fusion should be applied. In turn, this sibling fusion nests the second loop inside the first. I'm not sure whether (a) sibling fusion is correctly determined to be allowed, but is applied wrongly by nesting the loops or (b) sibling fusion should not be allowed since the loops cannot be nested.

@dcaballe or @bondhugula could one of you give me some tips? Then I'll probably can get the patch done.

bondhugula commented 1 year ago

lir::affine::isLoopMemoryParallel concludes that both loops are not parallel while they should be. This is caused by the loop in

Which two loops are you referring to here?

The first loop nest is parallel, but the second isn't parallel (isLoopMemoryParallel considers memory dependences).

The fusion is clearly incorrect here, as you point out -- this is a bug.

Normally, -debug-only=affine-loop-fusion,loop-fusion-utils should give you useful info. In this case, the debug info also looks sparse. I'll fix that to start with.

llvm / llvm-project

mlir/Affine: sibling loop fusion with `fusion-maximal` on leads to invalid optimization #61820