Closed zjgarvey closed 1 month ago
Adding this to compiler bugs until proven otherwise.
The fact that a smaller repro didnt cause an issue seems like could be a stream allocation issue, i will start with looking there, also @zjgarvey how do you know which of these two numerics are incorrect? with the input generator (thanks for providing that) both outputs are just random numbers..
I was originally using the test suite and comparing with onnxruntime CPU result.
You can verify that the simplified IR is incorrect from IREE alone:
%169
(the add node). Call this "add_result.mlir".The first few values of this add result should match the first few values of the correct concat operation. By inspection, the unsimplified IR result matches the add result, whereas the simplified IR generates an output with completely different values.
This issue simply comes down to the fact that in the simplified IR we are able to fuse the add into the linalg.batch_mmt4d op which we couldnt fuse in the unsimplified case, and the below dispatch is just giving wrong numerics
#map11 = affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>
func.func @torch_jit_dispatch_56_batch_mmt4d_1x1x32x64x1x4x1_f32(%4: tensor<1x1x64x1x1xf32>, %5: tensor<1x32x64x4x1xf32>, %6: tensor<1x1x32x1x4xf32>) -> tensor<1x1x32x1x4xf32>{
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%7 = tensor.empty() : tensor<1x1x32x1x4xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<1x1x32x1x4xf32>) -> tensor<1x1x32x1x4xf32>
%9 = linalg.batch_mmt4d ins(%4, %5 : tensor<1x1x64x1x1xf32>, tensor<1x32x64x4x1xf32>) outs(%8 : tensor<1x1x32x1x4xf32>) -> tensor<1x1x32x1x4xf32>
%10 = linalg.generic {indexing_maps = [#map11, #map11, #map11],
iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"]}
ins(%6, %9 : tensor<1x1x32x1x4xf32>, tensor<1x1x32x1x4xf32>)
outs(%7 : tensor<1x1x32x1x4xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%11 = arith.addf %in, %in_0 : f32
linalg.yield %11 : f32
} -> tensor<1x1x32x1x4xf32>
return %10 : tensor<1x1x32x1x4xf32>
}
Without the elementwise it gives the correct numerics. Investigating why that would be now.
What happened?
In this gist, there are two very similar linalg IR reproducers. Compiling and running on CPU, they unexpectedly generate mismatching results.
The only difference between these two IR is the very last sequence:
Notice the cast operation
%cast_83
, somewhat stupidly makes the first dim dynamic before passing to%concat_86
. However, this IR produces correct numerics.Alternatively, the simplified version below will give wildly incorrect numerics:
Steps to reproduce your issue
iree-compile --iree-hal-target-backends=llvm-cpu correct_numerics.mlir -o correct_numerics.vmfb
iree-compile --iree-hal-target-backends=llvm-cpu wrong_numerics.mlir -o wrong_numerics.vmfb
rng = numpy.random.default_rng(19) a = rng.random((1,3,224,224)).astype(numpy.float32) with open("input.0.bin", "wb") as f: mylist = a.flatten().tolist() bytearr = struct.pack("%sf" % len(mylist), *mylist) f.write(bytearr) f.close()
vs. removing the cast operation does not reproduce this issue.
I have not tried reproducing this issue on other backends/devices.