microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.59k stars 2.77k forks source link

[Mobile] Crush error occurs in onnxrunitme.run(). #17647

Closed adjh54ir closed 8 months ago

adjh54ir commented 9 months ago

Describe the issue

Hello,

The product is being developed using the onnxruntime-react-native library. I tested it with the same version until the previous 1~2 months and confirmed that it was well performed well, At some point, onnxruntime.run() has a crush error, causing the app to turn off!

I have confirmed that the model is being imported, and I receive INPUT, OUTPUT names as return values. If the key value is different, the error will also be returned as shown below. {"handler": {"inputNames": ["input_1:0"], "outputNames": ["pred_pose/mul_24:0"]}} LOG Error: input 'input_1:0' is missing in 'feeds'.

So, in the end, I suspect that it is an error that occurred while performing run().

I need help!!

  1. Google Crashlitics

Untitled

  1. Sentry

Untitled Untitled

  1. Logcat

스크린샷 2023-09-21 오후 5 03 44

This is the main version I'm using.

To reproduce

all ort inference code

Urgency

No response

Platform

Android

OS Version

1.14

ONNX Runtime Installation

Built from Source

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

None

ONNX Runtime Version or Commit ID

1.14

ONNX Runtime API

JavaScript

Architecture

ARM64

Execution Provider

Other / Unknown

Execution Provider Library Version

No response

skottmckay commented 9 months ago

Can you please provide the exact differences between the version that worked and the one with the crash, including any differences in the dependencies of your app?

adjh54ir commented 9 months ago

Can you please provide the exact differences between the version that worked and the one with the crash, including any differences in the dependencies of your app?

Hello

A previous project was re-created due to an unknown error. And I changed @react-native-community/cli only to 9.3.2 -> 11.3.8.

I can't determine at what point the collision occurred, but I know the previous package.json. Below you have included a list of current libraries and a list of libraries that were previously performed.

< pacage.json file that was being performed >


{
  "dependencies": {
    "@react-keycloak/native": "^0.6.4",
    "@react-native-async-storage/async-storage": "^1.19.1",
    "@react-native-community/netinfo": "^9.3.10",
    "@react-native-community/slider": "^4.4.2",
    "@react-native-masked-view/masked-view": "^0.2.9",
    "@react-native-picker/picker": "^2.4.10",
    "@react-navigation/bottom-tabs": "^6.5.8",
    "@react-navigation/native": "^6.1.6",
    "@react-navigation/stack": "^6.3.16",
    "@reduxjs/toolkit": "^1.9.5",
    "@tanstack/react-query": "^4.29.13",
    "@tensorflow-models/face-landmarks-detection": "0.0.3",
    "@tensorflow/tfjs": "3.7.0",
    "@tensorflow/tfjs-backend-webgl": "3.7.0",
    "@tensorflow/tfjs-converter": "3.7.0",
    "@tensorflow/tfjs-core": "3.7.0",
    "@tensorflow/tfjs-react-native": "^0.8.0",
    "axios": "^1.4.0",
    "expo": "^47.0.0",
    "expo-asset": "^8.9.1",
    "expo-camera": "^13.2.1",
    "expo-gl": "^12.4.0",
    "onnxruntime-react-native": "^1.14.0",
    "react": "18.1.0",
    "react-hook-form": "^7.45.1",
    "react-native": "0.70.6",
    "react-native-bouncy-checkbox": "^3.0.7",
    "react-native-canvas": "^0.1.39",
    "react-native-date-picker": "^4.2.13",
    "react-native-dotenv": "^3.4.8",
    "react-native-draggable-flatlist": "^4.0.1",
    "react-native-flip-card": "^3.5.7",
    "react-native-fs": "^2.20.0",
    "react-native-gesture-handler": "^2.11.0",
    "react-native-inappbrowser-reborn": "^3.7.0",
    "react-native-pager-view": "^6.2.0",
    "react-native-permissions": "^3.8.0",
    "react-native-responsive-screen": "^1.4.2",
    "react-native-safe-area-context": "^4.5.3",
    "react-native-swipe-list-view": "^3.2.9",
    "react-native-swipe-up-down": "^1.2.0",
    "react-native-swiper": "^1.6.0",
    "react-native-tab-view": "^3.5.2",
    "react-native-vector-icons": "^9.2.0",
    "react-native-webview": "^13.2.2",
    "react-redux": "^8.1.1",
    "redux-logger": "^3.0.6",
    "redux-persist": "^6.0.0"
  },
  "devDependencies": {
    "@babel/core": "^7.12.9",
    "@babel/runtime": "^7.12.5",
    "@react-native-community/eslint-config": "^2.0.0",
    "@tsconfig/react-native": "^2.0.2",
    "@types/": "react-navigation/native",
    "@types/jest": "^26.0.23",
    "@types/react": "^18.0.21",
    "@types/react-native": "^0.70.6",
    "@types/react-native-canvas": "^0.1.9",
    "@types/react-native-dotenv": "^0.2.0",
    "@types/react-redux": "^7.1.25",
    "@types/react-test-renderer": "^18.0.0",
    "@types/redux-logger": "^3.0.9",
    "@typescript-eslint/eslint-plugin": "^5.37.0",
    "@typescript-eslint/parser": "^5.37.0",
    "babel-jest": "^26.6.3",
    "babel-plugin-module-resolver": "^5.0.0",
    "base-64": "^1.0.0",
    "eslint": "^7.32.0",
    "eslint-plugin-import": "^2.27.5",
    "jest": "^26.6.3",
    "metro-react-native-babel-preset": "0.72.3",
    "prettier": "2.8.8",
    "react-test-renderer": "18.1.0",
    "typescript": "^4.8.3"
  },
  "jest": {
    "preset": "react-native",
    "moduleFileExtensions": [
      "ts",
      "tsx",
      "js",
      "jsx",
      "json",
      "node"
    ]
  }
}

< Current package.json file >

{
  "dependencies": {
    "@react-native-async-storage/async-storage": "1.19.3",
    "@react-native-community/netinfo": "9.4.1",
    "@react-native-community/slider": "4.4.2",
    "@react-native-firebase/analytics": "18.4.0",
    "@react-native-firebase/app": "18.4.0",
    "@react-native-firebase/crashlytics": "18.4.0",
    "@react-native-masked-view/masked-view": "0.2.9",
    "@react-native-picker/picker": "2.4.10",
    "@react-navigation/bottom-tabs": "6.5.8",
    "@react-navigation/native": "6.1.7",
    "@react-navigation/stack": "6.3.17",
    "@reduxjs/toolkit": "1.9.5",
    "@sentry/react-native": "5.9.1",
    "@sentry/tracing": "7.69.0",
    "@tensorflow-models/face-landmarks-detection": "0.0.3",
    "@tensorflow/tfjs": "3.7.0",
    "@tensorflow/tfjs-backend-webgl": "3.7.0",
    "@tensorflow/tfjs-converter": "3.7.0",
    "@tensorflow/tfjs-core": "3.7.0",
    "@tensorflow/tfjs-react-native": "0.8.0",
    "axios": "1.5.0",
    "base-64": "1.0.0",
    "expo": "49.0.0",
    "expo-asset": "8.12.0",
    "expo-camera": "13.6.0",
    "expo-gl": "13.2.0",
    "expo-linear-gradient": "12.5.0",
    "install-expo-modules": "0.6.3",
    "jwt-decode": "3.1.2",
    "onnxruntime-react-native": "1.14.0",
    "react": "18.2.0",
    "react-hook-form": "7.46.1",
    "react-native": "0.72.4",
    "react-native-bouncy-checkbox": "3.0.7",
    "react-native-calendars": "1.1300.0",
    "react-native-canvas": "0.1.39",
    "react-native-chart-kit": "6.12.0",
    "react-native-circular-progress": "1.3.9",
    "react-native-date-picker": "4.2.13",
    "react-native-dotenv": "3.4.8",
    "react-native-flip-card": "3.5.7",
    "react-native-fs": "2.20.0",
    "react-native-gesture-handler": "2.12.1",
    "react-native-linear-gradient": "2.8.3",
    "react-native-pager-view": "6.2.1",
    "react-native-paper": "5.10.4",
    "react-native-permissions": "3.8.0",
    "react-native-safe-area-context": "4.7.2",
    "react-native-screens": "2.15.0",
    "react-native-splash-screen": "3.3.0",
    "react-native-svg": "13.13.0",
    "react-native-svg-transformer": "1.1.0",
    "react-native-swipe-list-view": "3.2.9",
    "react-native-swiper": "1.6.0",
    "react-native-tab-view": "3.5.2",
    "react-native-vector-icons": "10.0.0",
    "react-native-webview": "13.6.0",
    "react-redux": "8.1.2",
    "redux-logger": "3.0.6",
    "redux-persist": "6.0.0"
  },
  "devDependencies": {
    "@babel/core": "7.20.20",
    "@babel/preset-env": "7.20.20",
    "@babel/runtime": "7.20.0",
    "@react-native-community/eslint-config": "3.2.0",
    "@react-native/eslint-config": "0.72.2",
    "@react-native/metro-config": "0.72.11",
    "@tsconfig/react-native": "3.0.2",
    "@types/react": "18.2.21",
    "@types/react-native": "0.72.2",
    "@types/react-native-dotenv": "0.2.0",
    "@types/react-redux": "7.1.26",
    "@types/react-test-renderer": "18.0.0",
    "@types/redux-logger": "3.0.9",
    "@typescript-eslint/eslint-plugin": "6.7.0",
    "babel-jest": "29.2.1",
    "eslint": "8.19.0",
    "jest": "29.2.1",
    "metro-react-native-babel-preset": "0.76.8",
    "metro-react-native-babel-transformer": "0.77.0",
    "prettier": "2.4.1",
    "react-test-renderer": "18.2.0",
    "typescript": "4.8.4"
  },
  "engines": {
    "node": ">=16"
  },
  "expo": {
    "autolinking": {
      "exclude": [
        "expo-font",
        "expo-application",
        "expo-keep-awake"
      ],
      "ios": {
        "exclude": [
          "expo-keep-awake"
        ]
      }
    }
  }
}

Please reply.

YUNQIUGUO commented 9 months ago

similar issue #17541

Noticed that you are testing on a slightly older device model type with Android version 10 in the screenshot. Mind verifying for us if the similar crash still happens on device with newer Android version/ newer device model type?

Asking as locally I am testing the onnxruntime-react-native package (from version 1.14 to newest released 1.16.0) on a Samsung Galaxy Note20 (with Android 12) and the inference call succeeded. - just trying to narrow down the issue. thanks!

chenfucn commented 9 months ago

Can this be reproduced with a debug build of onnxruntime to help us narrow down the function or even line number ?

adjh54ir commented 9 months ago

similar issue #17541

Noticed that you are testing on a slightly older device model type with Android version 10 in the screenshot. Mind verifying for us if the similar crash still happens on device with newer Android version/ newer device model type?

Asking as locally I am testing the onnxruntime-react-native package (from version 1.14 to newest released 1.16.0) on a Samsung Galaxy Note20 (with Android 12) and the inference call succeeded. - just trying to narrow down the issue. thanks!

Hello,

As you said, I carried out it using the Poco phone f1 model. So this time, we conducted a test using various devices. Refer to the table below comparing each device.

Device Devices Success or failure OS processor Architecture Memory device architecture
Poco phone F1 failure Android 10 Qualcomm Snapdragon 845 SDM845 arm64-v8a 6GB LPDDR4X SDRAM 64/128 GB arm64-v8a
Galaxy S9 failure Android 10 Samsung Exynos 9 Series (9810) arm64-v8a 4 GB LPDDR4X SDRAM arm64-v8a
Galaxy S10+ Success Andriod 12 Samsung Exynos 9 Series (9820) arm64-v8a 8 GB LPDDR4X SDRAM arm64-v8a
Galaxy S20 Note Success Andriod 13 Qualcomm Snapdragon 865+ SM8250-AB arm64-v8a 8 GB LPDDR5 SDRAM arm64-v8a
Galaxy S23 Success Andriod 13 Qualcomm Snapdragon 8 Gen 2 for Galaxy SM8550-AC arm64-v8a 8 GB LPDDR5X SDRAM arm64-v8a

I guess it doesn't work on Android 10 version, is it correct? I'm curious why it's not working on the device I've been using for a long time

Please reply.

YUNQIUGUO commented 9 months ago

Thanks for the info. Good to know that it passes with newer device sets and more recent android versions.

At this point, I am not sure yet if it's issue with Android 10 or issues from invalid access to resources that is not available on older architecture. would have to test more to see.

on a side note, also noticed you've updated the react-native version to the latest 0.72.x. (compare to before) could also be a possible factor.

adjh54ir commented 9 months ago

Thanks for the info. Good to know that it passes with newer device sets and more recent android versions.

At this point, I am not sure yet if it's issue with Android 10 or issues from invalid access to resources that is not available on older architecture. would have to test more to see.

on a side note, also noticed you've updated the react-native version to the latest 0.72.x. (compare to before) could also be a possible factor.

I still have a little question about why the devices that were used during the development period are not performing all of a sudden.

I think the biggest problem was that it was a device that had been performing well for a long time in 0.70.6 environments rather than the native 0.72.4 version above, but suddenly it wasn't performing at some point.

I don't know what the clear answer is, but for now, we've decided to use the latest devices rather than the old ones. Also, we don't have a clear answer that it doesn't work in Android 10 environments, but we expect it to work on devices that haven't been done quickly.

Since the issue is still unresolved, I will leave the github issue open.

Thank you.

skottmckay commented 9 months ago

@chenfucn is it possible that these MLAS build flags could cause issues with older hardware that does not support ARMv8.2?

https://github.com/microsoft/onnxruntime/blob/1f4a3529ddde7796cd018b6c2bd97a6e099c4b88/cmake/onnxruntime_mlas.cmake#L347-L350

That may apply to the Galaxy S9 (and Note 10 Lite reported in another issue with the same crash) but the specs for the Poco F1 say that has two Cortex chips that are ARMv8.2 so in theory that one should be safe.

It's also possible it's a React Native issue given it is suddenly affecting multiple ORT versions. AFAIK we have had no reports of illegal instruction issues with React Native until the last few weeks but ORT 1.14 was released in February.

@adjh54ir is it possible to try with React Native 0.70 instead of 0.72 given 0,70 was used in your working version?

skottmckay commented 9 months ago

@adjh54ir I'm trying to make sense of the result with the Poco F1. All other devices reporting the issue seem to have CPU chips that only support ARMv8-A and it looks like our build flags could potentially be an issue there. The Poco F1 looks like it should be fine though, so I wanted to double check the chips in the actual device. Are you able to provide the info from one of these options to try and confirm what is in the device? https://www.geekersdigest.com/how-to-identify-cpu-information-type-and-model-on-android/

skottmckay commented 9 months ago

@adjh54ir does the model you're using take 32-bit float data? If so, would you be able to test with the latest ORT release (1.16) and the XNNPACK execution provider to see if that avoids the issue?

adjh54ir commented 9 months ago

@adjh54ir I'm trying to make sense of the result with the Poco F1. All other devices reporting the issue seem to have CPU chips that only support ARMv8-A and it looks like our build flags could potentially be an issue there. The Poco F1 looks like it should be fine though, so I wanted to double check the chips in the actual device. Are you able to provide the info from one of these options to try and confirm what is in the device? https://www.geekersdigest.com/how-to-identify-cpu-information-type-and-model-on-android/

Hello

I am finally replying because I am on vacation. First of all, thank you for your response to the question.

First, I would like to respond to the processor device chip for Pocopone F1 you requested. It was measured using the CUP-Z application I checked and confirmed that I have a 10mn processor.

I attach the picture below.

image

adjh54ir commented 9 months ago

@adjh54ir does the model you're using take 32-bit float data? If so, would you be able to test with the latest ORT release (1.16) and the XNNPACK execution provider to see if that avoids the issue?

Hello Here's the second answer

We have upgraded to onnxruntime-react-native: "1.16.0", and added 'xnpack' with Session Options as below. This is the source code applied below.

// [STEP4] hsemotion Model Load : ONNX
let _hemotionModel: ort.InferenceSession | null = null;
const _hemotionAssets = await Asset.loadAsync(HS_EMOTION_MODEL);
const _hemotionOnnxModelUri: string | null = _hemotionAssets[0].localUri;
if (_hemotionOnnxModelUri !== null) {
    await ort.InferenceSession.create(_hemotionOnnxModelUri, { executionProviders: ['xnnpack'] })
        .then((_loadSession: ort.InferenceSession) => {
            console.log("[+] Load Hsemotion Model....");
            _hemotionModel = _loadSession;
        })
        .catch((error) => { throw new Error(`[-] hsemotion Load Error: ${error}`); });
}

// [STEP5] FSA-NET Model Load : ONNX
let _fsanetModel: ort.InferenceSession | null = null;
const _fsaNetOnnxModeAssets = await Asset.loadAsync(FSA_NET_MODEL)
const _fsaNetOnnxModelUri: string | null = _fsaNetOnnxModeAssets[0].localUri;
if (_fsaNetOnnxModelUri !== null) {
    await ort.InferenceSession.create(_fsaNetOnnxModelUri, { executionProviders: ['xnnpack'] })
        .then((_loadSession: ort.InferenceSession) => {
            console.log("[+] Load FSA-NET Model....");
            _fsanetModel = _loadSession;
        })
        .catch((error) => { throw new Error(`[-] FSA-NET Load Error: ${error}`); });
}

// [STEP6] hPose Model Load : ONNX
let _hposeModel: ort.InferenceSession | null = null;
const _hposeAssets = await Asset.loadAsync(HPOSE_MODEL);
const _hposeOnnxModelUri: string | null = _hposeAssets[0].localUri;
if (_hposeOnnxModelUri !== null) {
    await ort.InferenceSession.create(_hposeOnnxModelUri, { executionProviders: ['xnnpack'] })
        .then((_loadSession: ort.InferenceSession) => {
            console.log("[+] Load HPose Model....");
            _hposeModel = _loadSession;
        })
        .catch((error) => { throw new Error(`[-] HPose Load Error: ${error}`); });
}

However, the following error occurred.

Error: [-] hsemotion Load Error: Error: Can't load a model: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/kernel_registry.cc:69 bool onnxruntime::(anonymous namespace)::MatchKernelDefTypes(const onnxruntime::Node &, const std::unordered_map<std::string, std::vector> &, const onnxruntime::IKernelTypeStrResolver &, std::string &) [ONNXRuntimeError] : 1 : FAIL : kernel_type_str_resolver.cc:20 ResolveKernelTypeStr Failed to find op_id: com.ms.internal.nhwc:Conv:11

at anonymous (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:229474:32)
at tryCallOne (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:37791:16)
at anonymous (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:37872:27)
at apply (native)
at anonymous (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:39114:26)
at _callTimer (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:38993:17)
at _callReactNativeMicrotasksPass (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:39038:17)
at callReactNativeMicrotasks (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:39244:44)
at __callReactNativeMicrotasks (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:3174:46)
at anonymous (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:2948:45)
at __guard (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:3147:15)
at flushedQueue (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:2947:21)
at invokeCallbackAndReturnFlushedQueue (http://localhost:8081/index.bundle//&platform=android&dev=true&minify=false&app=com.tugboat_mobile&modulesOnly=false&runModule=true:2941:33)
skottmckay commented 9 months ago

Are you able to share the model that failed? The XNNPACK EP definitely has a kernel for Conv in opset 11 so it's unexpected it would fail like this.

skottmckay commented 9 months ago

@adjh54ir would still like to look into the XNNPACK issue if you're able to share the model.

We have a repro for the problem. Are you able to run a test binary from the ORT build to validate it fails with the current version and is fixed with a new version?

The binary is onnxruntime_mlas_test which comes from https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/mlas/unittest

There are two versions of it. One from a build of the 1.16.1 release, and one with a fix applied to revert https://github.com/microsoft/onnxruntime/pull/16082

onnxruntime_mlas_test_1.16.1 onnxruntime_mlas_test_1.16.1_patch

Build was with these parameters:

./build  --android --android_api=24 --android_sdk="d:\Android" --android_abi=arm64-v8a --parallel --android_ndk_path="D:\Android\ndk\25.1.8937393" --build_shared_lib --cmake_generator Ninja

I'm most interested in validating the failure on the Poco F1 to make sure there are not two different issues.

onnxruntime_mlas_test.zip

You'll need to use adb and copy the binaries to /data/local/tmp, chmod +x , and run.

adb push ./onnxruntime_mlas_test_1.16.1 /data/local/tmp
adb push ./onnxruntime_mlas_test_1.16.1_patch /data/local/tmp
adb shell
> cd /data/local/tmp
> chmod +x ./onnxruntime_mlas_test*
> ./onnxruntime_mlas_test_1.16.1        <--- should fail
> ./onnxruntime_mlas_test_1.16.1_patch
skottmckay commented 8 months ago

Will be fixed in 1.16.2