[mlir][linalg] `linalg.generic` miscompilation in case of multiple return values

pingshiyu commented 1 month ago

I've been toying with the semantics of linalg.generic lately, and I came across this program:

module {
    func.func private @printMemrefI64(tensor<*xi64>)
    func.func private @printMemrefI32(tensor<*xi32>)
    func.func private @printMemrefF64(tensor<*xf64>)
    func.func private @printMemrefF32(tensor<*xf32>)
    func.func @main() {
        %cst = arith.constant dense<14> : tensor<1xi64>
        %cst_0 = arith.constant dense<15> : tensor<1xi64>
        %cst_1 = arith.constant dense<[[[-1], [-2], [10]]]> : tensor<1x3x1xi64>
        %0:2 = linalg.generic {
            indexing_maps = [
                // iteration space:
                // d0 -> [0, 0]
                // d1 -> [0, 2]
                affine_map<(d0, d1) -> (d0)>,
                affine_map<(d0, d1) -> (d0)>,
                affine_map<(d0, d1) -> (d0, d1, d0)>], 
            iterator_types = ["reduction", "reduction"]} 
        ins(%cst : tensor<1xi64>) outs(%cst_0, %cst_1 : tensor<1xi64>, tensor<1x3x1xi64>) {
        ^bb0(%in: i64, %out: i64, %out_3: i64):
            // %in=14, %out=15, %out_3=-1 -> %1=-1; expected: %0#0[0]=-1, %0#1[0,0,0]=15
            // %in=14, %out=15, %out_3=-2 -> %1=-2; expected: %0#0[0]=-2, %0#1[0,1,0]=15
            // %in=14, %out=15, %out_3=10 -> %1=10; expected: %0#0[0]=10, %0#1[0,2,0]=15
            %1 = arith.minsi %out_3, %out_3 : i64
            linalg.yield %1, %out : i64, i64
        } -> (tensor<1xi64>, tensor<1x3x1xi64>)
        vector.print str "%0#0="
        %cast = tensor.cast %0#0 : tensor<1xi64> to tensor<*xi64>
        // expect: [10]
        call @printMemrefI64(%cast) : (tensor<*xi64>) -> () 
        vector.print str "%0#1="
        %cast_2 = tensor.cast %0#1 : tensor<1x3x1xi64> to tensor<*xi64> 
        // expect: [[[15], [15], [15]]]
        // actual: [[[15], [-1], [-2]]]
        call @printMemrefI64(%cast_2) : (tensor<*xi64>) -> () 
        return
    }
}

where I left my understanding of the iteration space, and the expected outputs in the comments.

Expected Behaviour

Based on the analysis within the comments, the expected output is:

%0#0=...
[10]
%0#1=...
[[[15],
  [15],
  [15]]]

Which writes the last element of %cst_1 into %0#0, and duplicates %cst_0 into all elements of %0#1. However, the actual output is:

%0#0=...
[10]
%0#1=...
[[[15],
  [-1],
  [-2]]]

Reproduction

Lowering and executing the above program with the interpreter like so: mlir-opt prog.mlir -one-shot-bufferize -func-bufferize -cse -canonicalize -convert-vector-to-scf -test-lower-to-llvm | mlir-cpu-runner -e main --entry-point-result void --shared-libs="lib/mlir/libmlir_c_runner_utils.so,lib/mlir/libmlir_runner_utils.so"

The program seems to produce an output that doesn't align with my intuition (also in the comments) - did I misunderstand something about linalg.generic or is the compiler wrong here?

Reproduction for trunk here (to get the llvm IR that reproduces the behaviour seen above): https://godbolt.org/z/z31h71hYP

llvmbot commented 1 month ago

@llvm/issue-subscribers-mlir-linalg

Author: Jacob Yu (pingshiyu)

I've been toying with the semantics of `linalg.generic` lately for learning, and I came across this program: ```mlir module { func.func private @printMemrefI64(tensor<*xi64>) func.func private @printMemrefI32(tensor<*xi32>) func.func private @printMemrefF64(tensor<*xf64>) func.func private @printMemrefF32(tensor<*xf32>) func.func @main() { %cst = arith.constant dense<14> : tensor<1xi64> %cst_0 = arith.constant dense<15> : tensor<1xi64> %cst_1 = arith.constant dense<[[[-1], [-2], [10]]]> : tensor<1x3x1xi64> %0:2 = linalg.generic { indexing_maps = [ // iteration space: // d0 -> [0, 0] // d1 -> [0, 2] affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1, d0)>], iterator_types = ["reduction", "reduction"]} ins(%cst : tensor<1xi64>) outs(%cst_0, %cst_1 : tensor<1xi64>, tensor<1x3x1xi64>) { ^bb0(%in: i64, %out: i64, %out_3: i64): // %in=14, %out=15, %out_3=-1 -> %1=-1; expected: %0#0[0]=-1, %0#1[0,0,0]=15 // %in=14, %out=15, %out_3=-2 -> %1=-2; expected: %0#0[0]=-2, %0#1[0,1,0]=15 // %in=14, %out=15, %out_3=10 -> %1=10; expected: %0#0[0]=10, %0#1[0,2,0]=15 %1 = arith.minsi %out_3, %out_3 : i64 linalg.yield %1, %out : i64, i64 } -> (tensor<1xi64>, tensor<1x3x1xi64>) vector.print str "%0#0=" %cast = tensor.cast %0#0 : tensor<1xi64> to tensor<*xi64> // expect: [10] call @printMemrefI64(%cast) : (tensor<*xi64>) -> () vector.print str "%0#1=" %cast_2 = tensor.cast %0#1 : tensor<1x3x1xi64> to tensor<*xi64> // expect: [[[15], [15], [15]]] // actual: [[[15], [-1], [-2]]] call @printMemrefI64(%cast_2) : (tensor<*xi64>) -> () return } } ``` where I left my understanding of the iteration space, and the expected outputs in the comments. When lowered and executed with the interpreter like so: ```mlir-opt prog.mlir -one-shot-bufferize -func-bufferize -cse -canonicalize -convert-vector-to-scf -test-lower-to-llvm | mlir-cpu-runner -e main --entry-point-result void --shared-libs="lib/mlir/libmlir_c_runner_utils.so,lib/mlir/libmlir_runner_utils.so"``` The program seems to produce an output that doesn't align with my intuition (also in the comments) - did I misunderstand something about `linalg.generic` or is the compiler wrong here? Reproduction for trunk here (to get the `llvm` IR that reproduces the behaviour seen above): https://godbolt.org/z/z31h71hYP

ftynse commented 1 month ago

There is an implicit convention in Linalg that it must define/overwrite the entire result. Indexing maps such as affine_map<(d0, d1) -> (d0, d1, d0)> are going against that convention, leading to unpredicted results (specifically, values that were not indexed through this map are "uninitialized" and may contain garbage).

Maybe we should try adding a verifier for this. cc @nicolasvasilache

pingshiyu commented 1 month ago

Thank you @ftynse for taking a look!

There is an implicit convention in Linalg that it must define/overwrite the entire result. Indexing maps such as affine_map<(d0, d1) -> (d0, d1, d0)> are going against that convention, leading to unpredicted results (specifically, values that were not indexed through this map are "uninitialized" and may contain garbage).

I do remember this restriction of overwriting the entire result from our earlier discussions (https://github.com/llvm/llvm-project/issues/94180) - however here, my understanding is that the whole range is being overwritten. The type of %cst_1 is tensor<1x3x1xi64>, and it seems to me the map affine_map(d0, d1) -> (d0, d1, d0)> does actually write over the whole range (i.e. d0 ranging [0,0], d1 ranging [0,2]), or am I misunderstanding something here?

ftynse commented 1 month ago

The type of %cst_1 is tensor<1x3x1xi64>, and it seems to me the map affine_map(d0, d1) -> (d0, d1, d0)> does actually write over the whole range, or am I understanding something here?

You are right, haven't paid attention to the size.

There is another issue though. The comments assume that %out and %out_3 are assigned the values that had been stored in outputs before the execution began. This is not correct. They get assigned the current value, which could have been updated by previous iterations. That's how values get accumulated in actual reductions (the code here isn't really performing a reduction).

Assuming naive sequential loop execution order, you'll get

// %in=14, %out=15 (read from %0#0[0]), %out_3=-1 -> %1=-1; yield -1, 15; actual: %0#0[0]=-1, %0#1[0,0,0]=15
// %in=14, %out=-1 (read from %0#0[0]), %out_3=-2 -> %1=-2; yield -2, -1; actual: %0#0[0]=-2, %0#1[0,1,0]=-1
// %in=14, %out=-2 (read from %0#0[0]), %out_3=10 -> %1=10; yield 10, -2; actual: %0#0[0]=10, %0#1[0,2,0]=-2

exactly the output you see. Note, however, that even for reductions the order of iterations is not guaranteed, so the results are actually undefined.

Corollary: iteration dimensions associated with reduction loops must not appear in result indexings.

pingshiyu commented 1 month ago

Thank you for that! It makes sense :)

I have a followup question regarding parallel vs reduction semantics:

Corollary: iteration dimensions associated with reduction loops must not appear in result indexing. To make this clearer, the reason for this is because with reduction loops, the values of elements being written to is the elements being read from, therefore there'd be a data race when a result is a part of a reduction loop. Is that accurate?

What about for parallel loops? I notice the same result on the above program, when I change the loop iterators to parallel. But for the parallel case, it seems like the initial comments

// %in=14, %out=15, %out_3=-1 -> %1=-1; expected: %0#0[0]=-1, %0#1[0,0,0]=15
// %in=14, %out=15, %out_3=-2 -> %1=-2; expected: %0#0[0]=-2, %0#1[0,1,0]=15
// %in=14, %out=15, %out_3=10 -> %1=10; expected: %0#0[0]=10, %0#1[0,2,0]=15

would apply. I wasn't able to wrap my head around this behaviour now

ftynse commented 1 month ago

Same logic still applies:

The comments assume that %out and %out_3 are assigned the values that had been stored in outputs before the execution began. This is not correct. They get assigned the current value, which could have been updated by previous iterations.

llvm / llvm-project

[mlir][linalg] `linalg.generic` miscompilation in case of multiple return values #94179

Expected Behaviour

Reproduction