NAN Results for the CPU backend

vivekkhandelwal1 commented 1 year ago

What happened?

Hi, I'm getting an all-NAN output for the falcon-180B-gptq model for the CPU backend which gives the correct result for the PyTorch. I have generated a smaller repro from the original IR which also results in all-NAN output.

For reproducing the issue, download the IR, and the inputs (input1 and input2) to the model. Also, you can find the pre-compiled vmfb here.

Steps to reproduce your issue

Please download the IR and the inputs from the above link and then run the following commands.

Run the following command for compilation:

iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary     --iree-hal-target-backends=llvm-cpu --mlir-print-debuginfo     --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host     --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64     --iree-vm-bytecode-module-strip-source-map=false --iree-util-zero-fill-elided-attrs     --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-stream-resource-max-allocation-size=3221225472 --iree-llvmcpu-target-triple=x86_64-linux-gnu  falcon_180b_gptq_repro_ir.mlir -o falcon_180b_gptq_nan_repro.vmfb

And, then use the following command for execution:

iree-run-module --device=local-task --module=falcon_180b_gptq_nan_repro.vmfb --function=forward --input=@falcon_gptq_repro_input2.npy --input=@falcon_gptq_repro_input2.npy

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

vivekkhandelwal1 commented 1 year ago

Hi, I have been able to dig deep into the issue and found out the exact op and the set of inputs with which it's resulting in the NAN values. The culprit here is the softmax op. Here's the linalg IR, resulting in nan values:

module attributes {torch.debug_module_name = "SoftMaxClass"} {
  ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
  func.func @forward(%arg0: tensor<1x232x100x100xf32>) -> tensor<1x232x100x100xf32> {
    %c0_i64 = arith.constant 0 : i64
    %cst = arith.constant 0xFF800000 : f32
    %cst_0 = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<1x232x100x1xi64>
    %1 = linalg.fill ins(%c0_i64 : i64) outs(%0 : tensor<1x232x100x1xi64>) -> tensor<1x232x100x1xi64>
    %2 = tensor.empty() : tensor<1x232x100x1xf32>
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<1x232x100x1xf32>) -> tensor<1x232x100x1xf32>
    %4:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%arg0 : tensor<1x232x100x100xf32>) outs(%3, %1 : tensor<1x232x100x1xf32>, tensor<1x232x100x1xi64>) {
    ^bb0(%in: f32, %out: f32, %out_1: i64):
      %11 = linalg.index 3 : index
      %12 = arith.index_cast %11 : index to i64
      %13 = arith.maximumf %in, %out : f32
      %14 = arith.cmpf ogt, %in, %out : f32
      %15 = arith.select %14, %12, %out_1 : i64
      linalg.yield %13, %15 : f32, i64
    } -> (tensor<1x232x100x1xf32>, tensor<1x232x100x1xi64>)
    %5 = tensor.empty() : tensor<1x232x100x100xf32>
    %6 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%arg0, %4#0 : tensor<1x232x100x100xf32>, tensor<1x232x100x1xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
    ^bb0(%in: f32, %in_1: f32, %out: f32):
      %11 = arith.subf %in, %in_1 : f32
      linalg.yield %11 : f32
    } -> tensor<1x232x100x100xf32>
    %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%6 : tensor<1x232x100x100xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
    ^bb0(%in: f32, %out: f32):
      %11 = math.exp %in : f32
      linalg.yield %11 : f32
    } -> tensor<1x232x100x100xf32>
    %8 = linalg.fill ins(%cst_0 : f32) outs(%2 : tensor<1x232x100x1xf32>) -> tensor<1x232x100x1xf32>
    %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, 0)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%7 : tensor<1x232x100x100xf32>) outs(%8 : tensor<1x232x100x1xf32>) {
    ^bb0(%in: f32, %out: f32):
      %11 = arith.addf %in, %out : f32
      linalg.yield %11 : f32
    } -> tensor<1x232x100x1xf32>
    %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (0, d1, d2, 0)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%7, %9 : tensor<1x232x100x100xf32>, tensor<1x232x100x1xf32>) outs(%5 : tensor<1x232x100x100xf32>) {
    ^bb0(%in: f32, %in_1: f32, %out: f32):
      %11 = arith.divf %in, %in_1 : f32
      linalg.yield %11 : f32
    } -> tensor<1x232x100x100xf32>
    return %10 : tensor<1x232x100x100xf32>
  }
}

For compilation, run:

iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary     --iree-hal-target-backends=llvm-cpu --mlir-print-debuginfo     --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host     --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64     --iree-vm-bytecode-module-strip-source-map=false --iree-util-zero-fill-elided-attrs     --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-stream-resource-max-allocation-size=3221225472 --iree-llvmcpu-target-triple=x86_64-linux-gnu  softmax_repro.mlir -o softmax_repro.vmfb

And, then run the vmfb using:

iree-run-module --device=local-task --module=softmax_repro.vmfb --function=forward --input=@softmax_repro_input.npy

The input to this can be accessed here.

Also, here's the Python script which compares the softmax result for the given input for PyTorch VS Torch-MLIR Refbackend VS IREE-SHARK. The IREE-SHARK output only results in nan, while the other two generate the accurate results.

stellaraccident commented 1 year ago

@MaheshRavishankar just pattern matching, but I've been seeing softmax issues fly by, although I can't connect any of them to a numeric issue like this.

Also a patch on maxf in flight, which would affect numeric instability.

MaheshRavishankar commented 1 year ago

Could some try this patch https://github.com/openxla/iree/pull/15130 and set the flag added in that patch to true.

I don't know what the inputs are, but maybe they were always NaN and now they are propagated as such.

MaheshRavishankar commented 1 year ago

Also this form of softmax is numerically unstable.

stellaraccident commented 1 year ago

Also this form of softmax is numerically unstable.

How are you determining that? (I believe you but was looking and concluded the opposite due to the epsilon).

MaheshRavishankar commented 1 year ago

Also this form of softmax is numerically unstable.

How are you determining that? (I believe you but was looking and concluded the opposite due to the epsilon).

Yeah, was looking on my phone and didn't follow properly. This form is the more stable one. Sorry for the noise.

Apart from change of the max being the NaN propagated version now by default, there is nothing that I can point to as the root cause. I am AFK for most of the day, so if someone can try with the patch above and using the non NaN propagated version, at least we can rule that out (I am hoping it is that, cause I can't really see what else would have introduced a bug, and we have e2e tests for softmax as well).

vivekkhandelwal1 commented 1 year ago

Fixed by https://github.com/openxla/iree/pull/15130

vivekkhandelwal1 commented 1 year ago

Reopening this issue since the TOM IREE is again resulting in NAN results. I reverted all the commits till https://github.com/openxla/iree/commit/9181525e43e117dec6bfb3464bd2ea7a5fb64e84, and the issue went away. I will be working on finding the commit using bisection which is causing this issue.

vivekkhandelwal1 commented 1 year ago

The issue exists for the Rocm backend too. In order to reproduce the issue, run the following script https://gist.github.com/vivekkhandelwal1/02034671d205c560185a2faafa3e39bd with the given input.

Abhishek-Varma commented 1 year ago

I've added a fix upstream which should resolve the NAN issue here : Fix arith::AtomicRMWKind::maximumf init value.

Summary of the triage/fix:

Softmax's decomposition is indeed correct (verified the earlier state through both code dry run + IR inspection).
One of the decomposition statement of Softmax tries to find maximum value across a dimension using arith::AtomicRMWKind::maximumf's semantics.
That ends up creating a wrong init maximum value of -1.4e-45.
With the above fix, the maximum value gets initialised correctly to the largest smallest value for f32 -> -3.4e+38.

dcaballe commented 1 year ago

(Flying by) Pinging @qcolombet as I remember he submitted a PR about min/max initialization not so long ago

iree-org / iree