Open pingshiyu opened 1 month ago
@llvm/issue-subscribers-mlir-linalg
Author: Jacob Yu (pingshiyu)
There is an implicit convention in Linalg that it must define/overwrite the entire result. Indexing maps such as affine_map<(d0, d1) -> (d0, d1, d0)>
are going against that convention, leading to unpredicted results (specifically, values that were not indexed through this map are "uninitialized" and may contain garbage).
Maybe we should try adding a verifier for this. cc @nicolasvasilache
Thank you @ftynse for taking a look!
There is an implicit convention in Linalg that it must define/overwrite the entire result. Indexing maps such as
affine_map<(d0, d1) -> (d0, d1, d0)>
are going against that convention, leading to unpredicted results (specifically, values that were not indexed through this map are "uninitialized" and may contain garbage).
I do remember this restriction of overwriting the entire result from our earlier discussions (https://github.com/llvm/llvm-project/issues/94180) - however here, my understanding is that the whole range is being overwritten. The type of %cst_1
is tensor<1x3x1xi64>
, and it seems to me the map affine_map(d0, d1) -> (d0, d1, d0)>
does actually write over the whole range (i.e. d0
ranging [0,0]
, d1
ranging [0,2]
), or am I misunderstanding something here?
The type of %cst_1 is tensor<1x3x1xi64>, and it seems to me the map affine_map(d0, d1) -> (d0, d1, d0)> does actually write over the whole range, or am I understanding something here?
You are right, haven't paid attention to the size.
There is another issue though. The comments assume that %out
and %out_3
are assigned the values that had been stored in outputs before the execution began. This is not correct. They get assigned the current value, which could have been updated by previous iterations. That's how values get accumulated in actual reductions (the code here isn't really performing a reduction).
Assuming naive sequential loop execution order, you'll get
// %in=14, %out=15 (read from %0#0[0]), %out_3=-1 -> %1=-1; yield -1, 15; actual: %0#0[0]=-1, %0#1[0,0,0]=15
// %in=14, %out=-1 (read from %0#0[0]), %out_3=-2 -> %1=-2; yield -2, -1; actual: %0#0[0]=-2, %0#1[0,1,0]=-1
// %in=14, %out=-2 (read from %0#0[0]), %out_3=10 -> %1=10; yield 10, -2; actual: %0#0[0]=10, %0#1[0,2,0]=-2
exactly the output you see. Note, however, that even for reductions the order of iterations is not guaranteed, so the results are actually undefined.
Corollary: iteration dimensions associated with reduction loops must not appear in result indexings.
Thank you for that! It makes sense :)
I have a followup question regarding parallel
vs reduction
semantics:
Corollary: iteration dimensions associated with reduction loops must not appear in result indexing. To make this clearer, the reason for this is because with reduction loops, the values of elements being written to is the elements being read from, therefore there'd be a data race when a result is a part of a reduction loop. Is that accurate?
What about for parallel
loops? I notice the same result on the above program, when I change the loop iterators to parallel
. But for the parallel case, it seems like the initial comments
// %in=14, %out=15, %out_3=-1 -> %1=-1; expected: %0#0[0]=-1, %0#1[0,0,0]=15
// %in=14, %out=15, %out_3=-2 -> %1=-2; expected: %0#0[0]=-2, %0#1[0,1,0]=15
// %in=14, %out=15, %out_3=10 -> %1=10; expected: %0#0[0]=10, %0#1[0,2,0]=15
would apply. I wasn't able to wrap my head around this behaviour now
Same logic still applies:
The comments assume that %out and %out_3 are assigned the values that had been stored in outputs before the execution began. This is not correct. They get assigned the current value, which could have been updated by previous iterations.
I've been toying with the semantics of
linalg.generic
lately, and I came across this program:where I left my understanding of the iteration space, and the expected outputs in the comments.
Expected Behaviour
Based on the analysis within the comments, the expected output is:
Which writes the last element of
%cst_1
into%0#0
, and duplicates%cst_0
into all elements of%0#1
. However, the actual output is:Reproduction
Lowering and executing the above program with the interpreter like so:
mlir-opt prog.mlir -one-shot-bufferize -func-bufferize -cse -canonicalize -convert-vector-to-scf -test-lower-to-llvm | mlir-cpu-runner -e main --entry-point-result void --shared-libs="lib/mlir/libmlir_c_runner_utils.so,lib/mlir/libmlir_runner_utils.so"
The program seems to produce an output that doesn't align with my intuition (also in the comments) - did I misunderstand something about
linalg.generic
or is the compiler wrong here?Reproduction for trunk here (to get the
llvm
IR that reproduces the behaviour seen above): https://godbolt.org/z/z31h71hYP