Add support for some multi-store cases in affine loop fusion

dcaballe commented 5 years ago

Hello!

We've been giving affine loop fusion a try and we are pretty happy with the initial results! Great work! We are seeing quite a few loop nests being fused in our models!

We noticed that currently only single-store producer loops are supported and would like to contribute a fix for that since it seems to be an important limitation. Even though our loop nests usually have a single store, this limitation is exposed when several single-store loop nests can be fused. The following example is a snippet of a model we are working on, where 4 loop nests could be fused into a single one:

func @main(%in : memref<1048576x256xf32>, %out : memref<1048576x256xf32>) {
  %cst = constant 0.000000e+00 : f32

  %6 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32>
      %15 = mulf %14, %13 : f32
      affine.store %15, %6[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  %7 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %6[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = cmpf "ogt", %13, %cst : f32
      %15 = select %14, %13, %cst : f32
      affine.store %15, %7[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  %8 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32>
      %15 = mulf %14, %13 : f32
      affine.store %15, %8[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %8[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = cmpf "ogt", %13, %cst : f32
      %15 = select %14, %13, %cst : f32
      affine.store %15, %out[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  dealloc %6 : memref<1048576x256xf32>
  dealloc %7 : memref<1048576x256xf32>
  dealloc %8 : memref<1048576x256xf32>
  return
}

However, affine loop fusion only fuses the first 3 loops and not the 4th one:

module {
  func @main(%arg0: memref<1048576x256xf32>, %arg1: memref<1048576x256xf32>) {
    %cst = constant 0.000000e+00 : f32
    %0 = alloc() : memref<1048576x256xf32>
    %1 = alloc() : memref<1048576x256xf32>
    %2 = alloc() : memref<1048576x256xf32>
    affine.for %arg2 = 0 to 1048576 {
      affine.for %arg3 = 0 to 256 {
        %3 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32>
        %4 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32>
        %5 = mulf %4, %3 : f32
        affine.store %5, %0[%arg2, %arg3] : memref<1048576x256xf32>
        %6 = affine.load %0[%arg2, %arg3] : memref<1048576x256xf32>
        %7 = cmpf "ogt", %6, %cst : f32
        %8 = select %7, %6, %cst : f32
        affine.store %8, %1[%arg2, %arg3] : memref<1048576x256xf32>
        %9 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32>
        %10 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32>
        %11 = mulf %10, %9 : f32
        affine.store %11, %2[%arg2, %arg3] : memref<1048576x256xf32>
      }
    }
    affine.for %arg2 = 0 to 1048576 {
      affine.for %arg3 = 0 to 256 {
        %3 = affine.load %2[%arg2, %arg3] : memref<1048576x256xf32>
        %4 = cmpf "ogt", %3, %cst : f32
        %5 = select %4, %3, %cst : f32
        affine.store %5, %arg1[%arg2, %arg3] : memref<1048576x256xf32>
      }
    }
    dealloc %0 : memref<1048576x256xf32>
    dealloc %1 : memref<1048576x256xf32>
    dealloc %2 : memref<1048576x256xf32>
    return
  }
}

This happens because the algorithm starts with the 3rd loop nest as consumer and it fuses 2nd and 1st one into it. Then, it picks 4th as consumer and tries to fuse the previously fused one into it, which it's not possible because it has multiple stores.

This PR is a stepping stone towards supporting generic multi-store producer loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store producer loop nests that:

have only one store that writes to a function-local live out, and
the remaining stores are only involved in loop nest self dependences or no dependences within the function.

It would be great to get feedback on this fix, if this approach is aligned with what you envision for the generic multi-output problem and if there are more cases that could be problematic and are currently not properly addressed in this PR.

Thanks! Diego

bondhugula commented 5 years ago

Hi Diego,

Thanks very much! The single-store limitation has indeed been a key limitation that we may want to immediately get rid of. I'll be happy to provide feedback on this PR - @andydavis1 will have more useful feedback than I here though.

On 02/10/2019 05:01, Diego Caballero wrote:

Hello!

We've been giving affine loop fusion a try and we are pretty happy with the initial results! Great work! We are seeing quite a few loop nests being fused in our models!

We noticed that currently only single-store producer loops are supported and would like to contribute a fix for that since it seems to be an important limitation. Even though our loop nests usually have a single store, this limitation is exposed when several single-store loop nests can be fused. The following example is a snippet of a model we are working on, where 4 loop nests could be fused into a single one:

|func @main(%in : memref<1048576x256xf32>, %out : memref<1048576x256xf32>) { %cst = constant 0.000000e+00 : f32 %6 = alloc() : memref<1048576x256xf32> affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32> %14 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32> %15 = mulf %14, %13 : f32 affine.store %15, %6[%arg7, %arg8] : memref<1048576x256xf32> } } %7 = alloc() : memref<1048576x256xf32> affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 = affine.load %6[%arg7, %arg8] : memref<1048576x256xf32> %14 = cmpf "ogt", %13, %cst : f32 %15 = select %14, %13, %cst : f32 affine.store %15, %7[%arg7, %arg8] : memref<1048576x256xf32> } } %8 = alloc() : memref<1048576x256xf32> affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32> %14 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32> %15 = mulf %14, %13 : f32 affine.store %15, %8[%arg7, %arg8] : memref<1048576x256xf32> } } affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 = affine.load %8[%arg7, %arg8] : memref<1048576x256xf32> %14 = cmpf "ogt", %13, %cst : f32 %15 = select %14, %13, %cst : f32 affine.store %15, %out[%arg7, %arg8] : memref<1048576x256xf32> } } dealloc %6 : memref<1048576x256xf32> dealloc %7 : memref<1048576x256xf32> dealloc %8 : memref<1048576x256xf32> return } |

However, affine loop fusion only fuses the first 3 loops and not the 4th one:

|module { func @main(%arg0: memref<1048576x256xf32>, %arg1: memref<1048576x256xf32>) { %cst = constant 0.000000e+00 : f32 %0 = alloc() : memref<1048576x256xf32> %1 = alloc() : memref<1048576x256xf32> %2 = alloc() : memref<1048576x256xf32> affine.for %arg2 = 0 to 1048576 { affine.for %arg3 = 0 to 256 { %3 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32> %4 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32> %5 = mulf %4, %3 : f32 affine.store %5, %0[%arg2, %arg3] : memref<1048576x256xf32> %6 = affine.load %0[%arg2, %arg3] : memref<1048576x256xf32> %7 = cmpf "ogt", %6, %cst : f32 %8 = select %7, %6, %cst : f32 affine.store %8, %1[%arg2, %arg3] : memref<1048576x256xf32> %9 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32> %10 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32> %11 = mulf %10, %9 : f32 affine.store %11, %2[%arg2, %arg3] : memref<1048576x256xf32> } } affine.for %arg2 = 0 to 1048576 { affine.for %arg3 = 0 to 256 { %3 = affine.load %2[%arg2, %arg3] : memref<1048576x256xf32> %4 = cmpf "ogt", %3, %cst : f32 %5 = select %4, %3, %cst : f32 affine.store %5, %arg1[%arg2, %arg3] : memref<1048576x256xf32> } } dealloc %0 : memref<1048576x256xf32> dealloc %1 : memref<1048576x256xf32> dealloc %2 : memref<1048576x256xf32> return } } |

This happens because the algorithm starts with the 3rd loop nest as consumer and it fuses 2nd and 1st one into it. Then, it picks 4th as consumer and tries to fuse the previously fused one into it, which it's not possible because it has multiple stores.

This PR is a stepping stone towards supporting generic multi-store producer loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store producer loop nests that:

have only one store that writes to a function-local live out, and

the remaining stores are only involved in loop nest self dependences or no dependences within the function.

It would be great to get feedback on this fix, if this approach is aligned with what you envision for the generic multi-output problem and if there are more cases that could be problematic and are currently not properly addressed in this PR.

Thanks! Diego
    You can view, comment on, or merge this pull request online at:
https://github.com/tensorflow/mlir/pull/162
    Commit Summary
Add support for some multi-store cases in affine fusion
File Changes
M lib/Transforms/LoopFusion.cpp https://github.com/tensorflow/mlir/pull/162/files#diff-0 (64)
M test/Transforms/loop-fusion.mlir https://github.com/tensorflow/mlir/pull/162/files#diff-1 (75)
Patch Links:
https://github.com/tensorflow/mlir/pull/162.patch

https://github.com/tensorflow/mlir/pull/162.diff
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/mlir/pull/162?email_source=notifications&email_token=ABVPBEJHXHCHD2UQH336YS3QMPMT3A5CNFSM4I4P6QK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HO7U2XA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVPBEPITRITRSTSWLSGGYTQMPMT3ANCNFSM4I4P6QKQ.

andydavis1 commented 5 years ago

Hello Diego,

Glad you are using Affine Loop Fusion, and thanks for the feedback and PR!!! I'll take a look at the PR in a bit. W.r.t future plans for this pass: I do have a plan for a new Affine Loop Fusion pass, which will handle the many of these corner cases in a unified way (and result in a simpler pass). I've started to build out the utilities for the new pass here: https://github.com/tensorflow/mlir/blob/master/include/mlir/Transforms/LoopFusionUtils.h Stay tuned!

dcaballe commented 5 years ago

Great! Thank you both! Please, keep me posted on the new fusion algorithm and let me know if I can help with reviews or something. Do you plan to send an RFC highlighting the main changes, limitations and improvements?

dcaballe commented 5 years ago

Thanks, Uday!

dcaballe commented 5 years ago

Anything else is needed?

bondhugula commented 5 years ago

This looks great to me.

andydavis1 commented 5 years ago

Yes. Looks good. Thanks Diego!

tensorflow / mlir

Add support for some multi-store cases in affine loop fusion #162