Use alloca instead of malloc for 0-d memref's? Getting scalar replacement working.

bondhugula commented 5 years ago

One way to hoist loop invariant memref load/store's and replace such subscripted accesses by scalars is by introducing 0-d memref's (load/store from the main memref to those 0-d memref's, then access the latter inside the loop), and rely on LLVM's mem2reg to eliminate the lowered counterparts of those 0-d memref's away, substituting (scalar) SSA values instead. However, currently, the lowering of MLIR's alloc to LLVM's malloc (in the -lower-to-llvm pass) gets in the way, but is trivially fixable as shown below.

Here's simple MLIR obtained after hoisting a loop invariant load (assume that the 0-d memref %s was automatically introduced by a pass/utility that performed the hosting for a loop invariant load on the original %A):

func @foo() {
  %c0 = constant 0 : index
  %cf0 = constant 0.0 : f32
  %A = alloc() : memref<1024xf32>
  %s = alloc() : memref<f32>
  %v = load %A[%c0] : memref<1024xf32>
  store %v, %s[]: memref<f32>

  affine.for %i = 0 to 1024 {
    %w = load %s[] : memref<f32>
    %r = addf %w, %cf0 : f32
  }
  return
}

The current lowering to LLVM yields: (-lower-affine -lower-to-llvm | mlir-translate -mlir-to-llvmir)

; ModuleID = 'LLVMDialectModule'
source_filename = "LLVMDialectModule"

declare i8* @malloc(i64)

declare void @free(i8*)

define void @foo() {
  %1 = call i8* @malloc(i64 4096)
  %2 = bitcast i8* %1 to float*
  %3 = call i8* @malloc(i64 4)
  %4 = bitcast i8* %3 to float*
  %5 = getelementptr float, float* %2, i64 0
  %6 = load float, float* %5
  store float %6, float* %4
  br label %7

7:                                                ; preds = %10, %0
  %8 = phi i64 [ %13, %10 ], [ 0, %0 ]
  %9 = icmp slt i64 %8, 1024
  br i1 %9, label %10, label %14

10:                                               ; preds = %7
  %11 = load float, float* %4
  %12 = fadd float %11, 0.000000e+00
  %13 = add i64 %8, 1
  br label %7

14:                                               ; preds = %7
  ret void
}

'mem2reg' on this doesn't yield the desired replacement (expected?).

However, using an alloca for the single element memref makes the whole thing work:

$ opt -S /tmp/test4.mlir -mem2reg
; ModuleID = '/tmp/test4.mlir'
source_filename = "LLVMDialectModule"

declare i8* @malloc(i64)

declare void @free(i8*)

define void @foo() {
  %1 = call i8* @malloc(i64 4096)
  %2 = bitcast i8* %1 to float*
  %3 = call i8* @malloc(i64 4)
  %4 = getelementptr float, float* %2, i64 0
  %5 = load float, float* %4
  br label %6

6:                                                ; preds = %9, %0
  %7 = phi i64 [ %11, %9 ], [ 0, %0 ]
  %8 = icmp slt i64 %7, 1024
  br i1 %8, label %9, label %12

9:                                                ; preds = %6
  %10 = fadd float %5, 0.000000e+00
  %11 = add i64 %7, 1
  br label %6

12:                                               ; preds = %6
  ret void
}

One option here would be to have AllocOpLowering use 'alloca' for 0-d memrefs (they have a single element by definition) instead of a malloc + bitcast, https://github.com/tensorflow/mlir/blob/10d0c61266696c4af211ee09669c3242ed3beea4/lib/Conversion/StandardToLLVM/ConvertStandardToLLVM.cpp#L432 but this would require the lowering to check for lifetimes - a concern that shouldn't be part of the lowering. Instead, one would want to have a pass/utility to replace alloc's with llvm.alloca when appropriate, but this isn't "lowering path neutral" and thus not reusable infra. Should one have a new standard op alloca for stack allocation so that a pass/utility could replace with when appropriate or in the first place generate such alloca's in place of alloc? The current alloc op doesn't say where memory gets allocated from.

(On a side note, another way to accomplish such scalar replacement of accesses is by leveraging support for loop live-out scalars when the IR supports it, but there are still good reasons for doing this via single element memref's and then relying on mem2reg, and also being able to go from one form to the other in MLIR itself.)

joker-eph commented 5 years ago

One way to hoist loop invariant memref load/store's and replace such subscripted accesses by scalars is by introducing 0-d memref's (load/store from the main memref to those 0-d memref's, then access the latter inside the loop),

Why not emitting just a load to a scalar value and use this value instead of creating a 0-d alloc?

Also your examples are a bit contrived: the free() are missing and the code does not produce anything so everything can be DCE'd.

bondhugula commented 5 years ago

One way to hoist loop invariant memref load/store's and replace such subscripted accesses by scalars is by introducing 0-d memref's (load/store from the main memref to those 0-d memref's, then access the latter inside the loop),

Why not emitting just a load to a scalar value and use this value instead of creating a 0-d alloc?

Because the scalar could be live out of the loop (while being stored to in the loop).

s = C[i];
for (k = 0; k < N; ++k) {
  s = s + A[k];
}
C[i] = s;

Also your examples are a bit contrived: the free() are missing and the code does not produce anything so everything can be DCE'd.

That was just to keep the examples short! :-) Just consider doing scalar replacement on something being reduced - like Y[i] below.

for (i = 0; i < N; ++i) {
  for (j = 0; j < N; ++j) {
   Y[i] += A[i][j] * X[j];
  }
}

There's no way to represent the scalar replacement of C[i] or Y[i] on the loop forms in MLIR without going through memory (single element memref's).

joker-eph commented 5 years ago

Because the scalar could be live out of the loop (while being stored to in the loop).

Oh right... I got short-sighted by your into in which you mentioned "hoisting a loop invariant load", but it was just a smaller use-case than the general one.

LLVM does not seem to be bother much by the malloc though, for instance the following code generate the exact same LLVM IR (after -O2): https://godbolt.org/z/2N4xVl

void foo(int Y[N], int A[N][N], int X[N]) {
  for (int i = 0; i < N; ++i) {
    int *s = (int*)malloc((sizeof(int)));
    *s = Y[i];
    for (int j = 0; j < N; ++j) {
      *s += A[i][j] * X[j];
    }
    Y[i] += *s;
    free(s);
  }    
}

void bar(int Y[N], int A[N][N], int X[N]) {
  for (int i = 0; i < N; ++i) { 
    int tmp;
    int *s = &tmp;
    *s = Y[i];
    for (int j = 0; j < N; ++j) {
      *s += A[i][j] * X[j];
    }
    Y[i] += *s;
  }    
}

(That said, it does not mean we can't do better modeling at the MLIR level)

dcaballe commented 5 years ago

This sounds pretty much the same kind of problem as the one discussed here https://github.com/tensorflow/mlir/pull/55#issuecomment-518134726. I think we should try to follow the same approach for both. I see the value of turning mallocs into llvm allocas for 0D memrefs and even for other small memrefs. However, this optimization should be optional. We may not want to apply it when mlir alloc is going to be lowered to a custom memory allocator since the memory allocator may expect to have the control of all the allocated memory.

Diego

bondhugula commented 5 years ago

Because the scalar could be live out of the loop (while being stored to in the loop).

Oh right... I got short-sighted by your into in which you mentioned "hoisting a loop invariant load", but it was just a smaller use-case than the general one.

That's right - my initial examples didn't capture the general case.

LLVM does not seem to be bother much by the malloc though, for instance the following code generate the exact same LLVM IR (after -O2): https://godbolt.org/z/2N4xVl
void foo(int Y[N], int A[N][N], int X[N]) {
  for (int i = 0; i < N; ++i) {
    int *s = (int*)malloc((sizeof(int)));
    *s = Y[i];
    for (int j = 0; j < N; ++j) {
      *s += A[i][j] * X[j];
    }
    Y[i] += *s;
    free(s);
  }    
}

void bar(int Y[N], int A[N][N], int X[N]) {
  for (int i = 0; i < N; ++i) { 
    int tmp;
    int *s = &tmp;
    *s = Y[i];
    for (int j = 0; j < N; ++j) {
      *s += A[i][j] * X[j];
    }
    Y[i] += *s;
  }    
}
(That said, it does not mean we can't do better modeling at the MLIR level)

Thanks very much for trying this out. I was just running mem2reg in isolation, but the -O2 pipeline probably includes a pass that replaces heap allocations with stack ones or something to that effect, after which mem2reg works. I can confirm that opt -O2 does get rid of heap allocations of all single element memref's (not just 0-d memref's, but also memref's like memref<1x1xf32>, etc.), and they are all replaced by scalars. (In all these use cases in context, these single-element memref's do not escape -- they are the trivial copy buffers generated with -affine-data-copy-generate).

This looks fine for now as long as one relies on LLVM, but at some point, this could be done in MLIR for a more unified approach.

bondhugula commented 5 years ago

This sounds pretty much the same kind of problem as the one discussed here #55 (comment). I think we should try to follow the same approach for both. I see the value of turning mallocs into llvm allocas for 0D memrefs and even for other small memrefs. However, this optimization should be optional. We may not want to apply it when mlir alloc is going to be lowered to a custom memory allocator since the memory allocator may expect to have the control of all the allocated memory. Diego

Yes, I agree. We can continue the discussion on #55.

tensorflow / mlir

Use alloca instead of malloc for 0-d memref's? Getting scalar replacement working. #65