microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.09k stars 2.84k forks source link

[Mobile] React Native app crash with Fatal signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x6f6d9217fc in tid 10411 (mqt_native_modu), pid 10224 (ReactNativeDemo) #17541

Closed bartproo closed 11 months ago

bartproo commented 12 months ago

Describe the issue

image I can confirm that this error occurs when I run const "fetches = await session.run(feeds);". As this is native crash, I have no idea how to fix it. Please help! The code works on pixel 4a emulator but not on my samsung note 10 lite.

To reproduce

Code crashed after running const fetches = await session.run(feeds); and upon setting breakpoint, the app crashing point is determined to be at OrtSession.java

OrtSession.java crashing point:

      OnnxValue[] outputValues =
          run(
              OnnxRuntime.ortApiHandle,
              nativeHandle,
              allocator.handle,
              inputNamesArray,
              inputHandles,
              inputNamesArray.length,
              outputNamesArray,
              outputNamesArray.length,
              runOptionsHandle);
      return new Result(outputNamesArray, outputValues);

My code crashing point:

export const predictModelfromUri = async (
  session: ort.InferenceSession,
  imageUri: string
): Promise<number> => {
  const imageFloat32 = await convertImageToFloat32Array(imageUri);
  const feeds: Record<string, ort.Tensor> = {};
  feeds[session.inputNames[0]] = new ort.Tensor(
    "float32",
    imageFloat32!,
    [1, 3, 224, 224]
  );
  const fetches = await session.run(feeds);
  const output: object = fetches[session.outputNames[0]].data;
  return findMaxId(Object.values(output));
};

Urgency

No response

Platform

Android

OS Version

13

ONNX Runtime Installation

Built from Source

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

None

ONNX Runtime Version or Commit ID

onnxruntime-react-native@1.15.1

ONNX Runtime API

Java/Kotlin

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

skottmckay commented 12 months ago

Can you clarify where this is running? You've said the architecture is X64 but in the stack trace it's using /lib/arm64/libonnxruntime.so.

bartproo commented 12 months ago

@skottmckay oops it should be arm64, on real device (samsung note 10 lite)

YUNQIUGUO commented 12 months ago

Hi, do you have a model that you ran with the react native app can be shared for trouble shooting?

bartproo commented 12 months ago

@YUNQIUGUO https://drive.google.com/file/d/1pOK0ZUDStoRGQzZDw9KmrPVw_Y1gHPVB/view?usp=drive_link

bartproo commented 12 months ago

@YUNQIUGUO I tried with other open source onnx model. Same result returned so its not the problem with the onnx model.

bartproo commented 11 months ago

@YUNQIUGUO How do i go about this?

bartproo commented 11 months ago

Just realised that the access to the drive link is restricted. I have allowed access for anyone with the link now

juliankotrba commented 11 months ago

Just FYI, we are experiencing the same (?) crash on Android 9 devices, but only since version 1.16.0. Version 1.15.1 is working fine for us.

Fatal signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x6f291947fc in tid 13393

In case I find any more information I will post it here.

YUNQIUGUO commented 11 months ago

Just FYI, we are experiencing the same (?) crash on Android 9 devices, but only since version 1.16.0. Version 1.15.1 is working fine for us.

Fatal signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x6f291947fc in tid 13393

In case I find any more information I will post it here.

thanks for the info. Mind sharing what specific device model that has Android 9 you were testing with? There could be libraries added not available to older architectures.

@chenfucn from the assembly, are you aware of anything obvious that's introduced in arm64 that may cause the crash?

juliankotrba commented 11 months ago

@YUNQIUGUO sure! It was a Samsung Galaxy A8 and a Nokia 8

chenfucn commented 11 months ago

nothing obvious comes to mind. arm64 assembly hasn't changed for a while now. What does ILL_ILLOPC mean? illegal op code? is it executing some instructions that the device does not support?

skottmckay commented 11 months ago

Just FYI, we are experiencing the same (?) crash on Android 9 devices, but only since version 1.16.0. Version 1.15.1 is working fine for us.

Fatal signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x6f291947fc in tid 13393

In case I find any more information I will post it here.

@juliankotrba were there any other differences between the app using 1.15.1 that worked and the one using 1.16.0 that crashes? Other reports of this issue involve older versions of ORT. #17647 mentions 1.14.0 and this issue was for 1.15.1. That would suggest some other component such as the react native version is at least contributing to the problem.

bartproo commented 11 months ago

@skottmckay I was able to run my code last time. Just that recent build it crashed. I havent touched the code in a while and did not make any changes to the code. Weird issue

skottmckay commented 11 months ago

@bartproo was there any other change like updating React Native itself?

juliankotrba commented 11 months ago

@skottmckay we are not aware of any other relevant changes in the code, just the update of the version of the onnx runtime for android.

seungillee commented 11 months ago

A similar issue happens to me. onnxruntime 1.15.1 works fine with my code, but when I update to onnxruntime 1.16.0, it gives SIGILL and crashes the app.

bartproo commented 11 months ago

@skottmckay, there aren't any updates to the code at all. The codes have not been changed for months. It just happen that when I rebuild the app recently I got this error. No changes to code and environment whatsoever. The new build still function in emulator and xiaomi phone but not my note 10 lite device. Old apk files that I had still work perfectly fine on my note 10 lite

YUNQIUGUO commented 11 months ago

A similar issue happens to me. onnxruntime 1.15.1 works fine with my code, but when I update to onnxruntime 1.16.0, it gives SIGILL and crashes the app.

ok. mind also sharing the device model type to us? thanks. @seungillee

seungillee commented 11 months ago

As far as I tested, it failed on Android 9 or older. It works fine on Android 10 or newer.

YUNQIUGUO commented 11 months ago

@bartproo we are seeing other reports mostly failing on older Android versions. And just to confirm your crash happens even in Android 13 version? and the xiaomi phone is also in Android 13 version? (the device which passes the new build)

asking as I can't repro the same issue with my testing device with new Android version.

Meanwhile, would you mind testing this simple enough onnx model which only contains a reshape op (to rule out issues with ORT code as reshape shouldn't involve ORT MLAS assembly) for us to see if still fails? If it succeeds, then there might be issues within our MLAS code or else. If it passes, then maybe could be issues happening outside of ORT. https://github.com/onnx/onnx/tree/e2525550194ce3d8a2c4a3af451c9d9b3ae6650e/onnx/backend/test/data/node/test_reshape_one_dim

The model may require provide an addition 1D int64 type - shape as the second input.

skottmckay commented 11 months ago

nothing obvious comes to mind. arm64 assembly hasn't changed for a while now. What does ILL_ILLOPC mean? illegal op code? is it executing some instructions that the device does not support?

@chenfu these devices are pretty old. According to this the Nokia 8 has a Qualcomm Snapdragon 835 which according to this has 'Kryo 280 (2.45 GHz Cortex-A73 + 1.9 GHz Cortex-A53)' and according to this both of those Cortex chips are ARMv8-A and not ARMv8.2-A.

The latest version of the Samsung Galaxy 8 also used a Cortex-A73/Cortext-A53 according to this.

This is still a little muddy when combined with https://github.com/microsoft/onnxruntime/issues/17647#issuecomment-1738542003 for two reasons

Here are some potential tests we could do using the onnx_test_runner binary to take React Native out of the picture. This can be run on device using adb.

We can provide a zip with the necessary onnx_test_runner binaries/models/input data and instructions to run them if someone is able to test this out on a device with the issue.

Regardless, we should consider changing the MLAS flags to target ARMv8-A rather than ARMv8.2-A for Android builds if we want to support old devices.

bartproo commented 11 months ago

@YUNQIUGUO Yes my crash happened on android 13 note 10 lite. The xiaomi is android 12. I tried running your model with the following code and it ran successfully

  feeds[session.inputNames[0]] = new ort.Tensor(
    "float32",
    new Float32Array(24),
    [2, 3, 4]
  );
  const shape = new ort.Tensor(
    "int64",
    new BigInt64Array([24n]),
    [1]
  );
  feeds["shape"] = shape;
  const fetches = await session.run(feeds);
bartproo commented 11 months ago

@skottmckay how to build with ARMv8-A instead of ARMv8.2-A? I believe this might be the issue

skottmckay commented 11 months ago

@bartproo it's still not clear to me if it is. Looking into it more, those flags are only used for 16-bit float kernels which I don't believe are relevant here. I've been trying to setup a test environment using an emulator to see if I can replicate the illegal opcode via that but so far have been unsuccessful.

Are you able to try the XNNPACK execution provider? That would have different kernels to the ORT MLAS implementation and performance on ARM for a 32-bit float model should be better. You'd need ORT 1.16 as the ability to register the XNNPACK execution provider was added in the latest release.

bartproo commented 11 months ago

@skottmckay, can u provide details of how to use XNNPACK execution provider in react native? Thank you!

skottmckay commented 11 months ago

I think it should be possible like this: https://github.com/microsoft/onnxruntime-inference-examples/blob/main/js/api-usage_session-options/README.md

const sessionOptions = { executionProviders: ['xnnpack'] };

And pass the sessionOptions to InferenceSession.create as the second argument.

bartproo commented 11 months ago

@skottmckay Thanks got it working using onnx 1.16 wtih XNNPACK

skottmckay commented 11 months ago

@bartproo Just to double check the model executes correctly with XNNPACK and we can say for sure that the ORT MLAS implementation is the issue?

YUNQIUGUO commented 11 months ago

@bartproo fwiw, here's a zip file contains a Release with debug info build for onnx_test_runner with flag armv8a instead of armv8.2-a: https://github.com/microsoft/onnxruntime/blob/yguo/armv8a-build-args-for-mlas/onnx_test_runner_armv8a_flag.zip and test input file which I believe can be used as the input data for your resnet18 model. You can try test 3 Scott mentioned above on your end.

FYI, onnx_test_runner can be adb pushed to your android device and along together with the input test data and onnx model.

And here's a doc for details of options to run with onnx_test_runner: https://github.com/microsoft/onnxruntime/blob/main/docs/Model_Test.md once the onnx_test_runner is pushed to the device, it will run something like e.g.

./onnx_test_runner -c 1 -j 1 -v path/to/testdataandmodels/ (need to create a directory contains the resnet model and the test input data).

skottmckay commented 11 months ago

@bartproo is the Android OS on the device 32-bit or 64-bit?

bartproo commented 11 months ago

should be 64 bit

yufenglee commented 11 months ago

@bartproo , we cannot repro the crash locally. Could you please share us the backtrace?

skottmckay commented 11 months ago

We finally have a repro. @bartproo there are a couple of test binaries that could be run to repro the issue and validate a fix. Instructions are in this reply: https://github.com/microsoft/onnxruntime/issues/17647#issuecomment-1752208453

bartproo commented 11 months ago

The tests ran as expected. Running onnxruntime_mlas_test_1.16.1 failed and onnxruntime_mlas_test_1.16.1_patch pass.

r7:/data/local/tmp $ ./onnxruntime_mlas_test_1.16.1
WARNING: linker: Warning: unable to normalize "'/data/local/tmp'" (ignoring)
-------------------------------------------------------
----Running normal quick check mode. To enable more complete test,
----  run with '--long' as first argument!
Illegal instruction
edgchen1 commented 11 months ago

Fixed by #17885