Open vivekkhandelwal1 opened 1 year ago
Hi, I have been able to dig deep into the issue and found out the exact op and the set of inputs with which it's resulting in the NAN values. The culprit here is the softmax op. Here's the linalg IR, resulting in nan values:
module attributes {torch.debug_module_name = "SoftMaxClass"} {
ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
func.func @forward(%arg0: tensor<1x232x100x100xf32>) -> tensor<1x232x100x100xf32> {
%c0_i64 = arith.constant 0 : i64
%cst = arith.constant 0xFF800000 : f32
%cst_0 = arith.constant 0.000000e+00 : f32
%0 = tensor.empty() : tensor<1x232x100x1xi64>
%1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1x232x100x1xi64>) -> tensor<1x232x100x1xi64>
%2 = tensor.empty() : tensor<1x232x100x1xf32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x232x100x1xf32>) -> tensor<1x232x100x1xf32>
%4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg0 : tensor<1x232x100x100xf32>) outs(%3, %1 : tensor<1x232x100x1xf32>, tensor<1x232x100x1xi64>) {
^bb0(%in: f32, %out: f32, %out_1: i64):
%11 = linalg.index 3 : index
%12 = arith.index_cast %11 : index to i64
%13 = arith.maximumf %in, %out : f32
%14 = arith.cmpf ogt, %in, %out : f32
%15 = arith.select %14, %12, %out_1 : i64
linalg.yield %13, %15 : f32, i64
} -> (tensor<1x232x100x1xf32>, tensor<1x232x100x1xi64>)
%5 = tensor.empty() : tensor<1x232x100x100xf32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%arg0, %4#0 : tensor<1x232x100x100xf32>, tensor<1x232x100x1xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%11 = arith.subf %in, %in_1 : f32
linalg.yield %11 : f32
} -> tensor<1x232x100x100xf32>
%7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<1x232x100x100xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
^bb0(%in: f32, %out: f32):
%11 = math.exp %in : f32
linalg.yield %11 : f32
} -> tensor<1x232x100x100xf32>
%8 = linalg.fill ins(%cst_0 : f32) outs(%2 : tensor<1x232x100x1xf32>) -> tensor<1x232x100x1xf32>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%7 : tensor<1x232x100x100xf32>) outs(%8 : tensor<1x232x100x1xf32>) {
^bb0(%in: f32, %out: f32):
%11 = arith.addf %in, %out : f32
linalg.yield %11 : f32
} -> tensor<1x232x100x1xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%7, %9 : tensor<1x232x100x100xf32>, tensor<1x232x100x1xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%11 = arith.divf %in, %in_1 : f32
linalg.yield %11 : f32
} -> tensor<1x232x100x100xf32>
return %10 : tensor<1x232x100x100xf32>
}
}
For compilation, run:
iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=false --iree-util-zero-fill-elided-attrs --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-stream-resource-max-allocation-size=3221225472 --iree-llvmcpu-target-triple=x86_64-linux-gnu softmax_repro.mlir -o softmax_repro.vmfb
And, then run the vmfb using:
iree-run-module --device=local-task --module=softmax_repro.vmfb --function=forward --input=@softmax_repro_input.npy
The input to this can be accessed here.
Also, here's the Python script which compares the softmax result for the given input for PyTorch VS Torch-MLIR Refbackend VS IREE-SHARK. The IREE-SHARK output only results in nan, while the other two generate the accurate results.
@MaheshRavishankar just pattern matching, but I've been seeing softmax issues fly by, although I can't connect any of them to a numeric issue like this.
Also a patch on maxf in flight, which would affect numeric instability.
Could some try this patch https://github.com/openxla/iree/pull/15130 and set the flag added in that patch to true.
I don't know what the inputs are, but maybe they were always NaN and now they are propagated as such.
Also this form of softmax is numerically unstable.
Also this form of softmax is numerically unstable.
How are you determining that? (I believe you but was looking and concluded the opposite due to the epsilon).
Also this form of softmax is numerically unstable.
How are you determining that? (I believe you but was looking and concluded the opposite due to the epsilon).
Yeah, was looking on my phone and didn't follow properly. This form is the more stable one. Sorry for the noise.
Apart from change of the max being the NaN propagated version now by default, there is nothing that I can point to as the root cause. I am AFK for most of the day, so if someone can try with the patch above and using the non NaN propagated version, at least we can rule that out (I am hoping it is that, cause I can't really see what else would have introduced a bug, and we have e2e tests for softmax as well).
Reopening this issue since the TOM IREE is again resulting in NAN results. I reverted all the commits till https://github.com/openxla/iree/commit/9181525e43e117dec6bfb3464bd2ea7a5fb64e84, and the issue went away. I will be working on finding the commit using bisection which is causing this issue.
The issue exists for the Rocm backend too. In order to reproduce the issue, run the following script https://gist.github.com/vivekkhandelwal1/02034671d205c560185a2faafa3e39bd with the given input.
I've added a fix upstream which should resolve the NAN issue here : Fix arith::AtomicRMWKind::maximumf init value.
Summary of the triage/fix:
arith::AtomicRMWKind::maximumf
's semantics.-1.4e-45
.f32
-> -3.4e+38
.(Flying by) Pinging @qcolombet as I remember he submitted a PR about min/max initialization not so long ago
What happened?
Hi, I'm getting an all-NAN output for the falcon-180B-gptq model for the CPU backend which gives the correct result for the PyTorch. I have generated a smaller repro from the original IR which also results in all-NAN output.
For reproducing the issue, download the IR, and the inputs (input1 and input2) to the model. Also, you can find the pre-compiled vmfb here.
Steps to reproduce your issue
Please download the IR and the inputs from the above link and then run the following commands.
Run the following command for compilation:
And, then use the following command for execution:
What component(s) does this issue relate to?
Runtime
Version information
No response
Additional context
No response