microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.45k stars 2.75k forks source link

[Mobile] Segmentation fault after repeated inference #21082

Open laurenspriem opened 2 weeks ago

laurenspriem commented 2 weeks ago

Describe the issue

I am getting a segmentation fault (SIGSEGV) after repeated inference runs on mobile, crashing the app. The issue only comes up after running inference for more than 300 times, but it consistently comes up after that. For context, I am using ORT in a Flutter app through FFI.

Error logs

06-18 16:11:46.395 30309 30624 F libc    : Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 30624 (2.ui), pid 30309 (tos.independent)
06-18 16:11:47.548 31118 31118 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
06-18 16:11:47.548 31118 31118 F DEBUG   : CalyxOS version: '5.7.2'
06-18 16:11:47.548 31118 31118 F DEBUG   : Build incremental version: '24507020'
06-18 16:11:47.549 31118 31118 F DEBUG   : Build fingerprint: 'google/redfin/redfin:14/UP1A.231105.001.B2/11260668:user/release-keys'
06-18 16:11:47.549 31118 31118 F DEBUG   : Revision: 'MP1.0'
06-18 16:11:47.549 31118 31118 F DEBUG   : ABI: 'arm64'
06-18 16:11:47.549 31118 31118 F DEBUG   : Timestamp: 2024-06-18 16:11:46.651759126+0530
06-18 16:11:47.549 31118 31118 F DEBUG   : Process uptime: 686s
06-18 16:11:47.549 31118 31118 F DEBUG   : Cmdline: io.ente.photos.independent
06-18 16:11:47.549 31118 31118 F DEBUG   : pid: 30309, tid: 30624, name: 2.ui  >>> io.ente.photos.independent <<<
06-18 16:11:47.549 31118 31118 F DEBUG   : uid: 10404
06-18 16:11:47.549 31118 31118 F DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0000000000000000
06-18 16:11:47.549 31118 31118 F DEBUG   : Cause: null pointer dereference
06-18 16:11:47.549 31118 31118 F DEBUG   :     x0  0000000000000000  x1  0000000000000051  x2  b400007a6e60b410  x3  0000000000000010
06-18 16:11:47.549 31118 31118 F DEBUG   :     x4  0000000000000000  x5  0000000000000000  x6  0000000000000480  x7  0000000000000640
06-18 16:11:47.549 31118 31118 F DEBUG   :     x8  0000000000000000  x9  b2d3c020e746ad14  x10 0000000000000003  x11 000000007e6f1618
06-18 16:11:47.549 31118 31118 F DEBUG   :     x12 b400007a1e648978  x13 b400007a1e648950  x14 b4000077b1652fc0  x15 0000000000000000
06-18 16:11:47.549 31118 31118 F DEBUG   :     x16 0000000000000001  x17 0000007c440eb518  x18 00000077c5368000  x19 0000007908c66148
06-18 16:11:47.549 31118 31118 F DEBUG   :     x20 0000007908c662a0  x21 000000000000003c  x22 b400007a7e650190  x23 0000007908c660b0
06-18 16:11:47.549 31118 31118 F DEBUG   :     x24 0000000000000000  x25 0000000000000000  x26 0000007908c67c00  x27 000000000000012c
06-18 16:11:47.549 31118 31118 F DEBUG   :     x28 b400007a2e613388  x29 0000007908c65f30
06-18 16:11:47.549 31118 31118 F DEBUG   :     lr  00000077ca3ff09c  sp  0000007908c65ea0  pc  00000077ca3ff070  pst 0000000080000000
06-18 16:11:47.549 31118 31118 F DEBUG   : 30 total frames
06-18 16:11:47.549 31118 31118 F DEBUG   : backtrace:
06-18 16:11:47.549 31118 31118 F DEBUG   :       #00 pc 00000000009fa070  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #01 pc 00000000009e5bdc  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #02 pc 00000000009e5468  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #03 pc 0000000000a0dc10  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #04 pc 0000000000a0d78c  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #05 pc 0000000000a0ee5c  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #06 pc 00000000003c91c8  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #07 pc 000000000039e8b4  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libonnxruntime.so (BuildId: 2e821a251292a43fb57cb005cf4be6686c138da8)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #08 pc 0000000000b21694  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #09 pc 0000000000d343f0  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #10 pc 0000000000d33d04  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #11 pc 000000000117d7d0  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #12 pc 000000000119bf08  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #13 pc 00000000016a6460  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #14 pc 0000000000b33850  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.549 31118 31118 F DEBUG   :       #15 pc 0000000000b33748  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #16 pc 0000000000b3370c  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #17 pc 0000000000b23f80  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libapp.so (BuildId: 6fc4a6dcc1628c6905ec2b43ba89d91c)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #18 pc 0000000000c3f034  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #19 pc 0000000000df04fc  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #20 pc 0000000000ba81fc  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #21 pc 0000000000861964  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #22 pc 00000000008654b8  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #23 pc 000000000000f63c  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+856) (BuildId: 30fb9ccffaff83282118eb2597dd4631)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #24 pc 0000000000019de0  /system/lib64/libandroid.so (ALooper_pollOnce+100) (BuildId: afd7c304b01296ae1a8e345f8e27fcc1)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #25 pc 00000000008655c4  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #26 pc 0000000000863674  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #27 pc 0000000000863460  /data/app/~~5o-AHNffC8fp5tg9QROXfQ==/io.ente.photos.independent-_hK-rsqz-LT1fhX16peg4A==/lib/arm64/libflutter.so (BuildId: 1445521dbe2e00121e10e2ebe6a1a8f1b78cf532)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #28 pc 00000000000bf1f4  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: 011e1f176d34c907f9e683504c06b67c)
06-18 16:11:47.550 31118 31118 F DEBUG   :       #29 pc 000000000005d984  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 011e1f176d34c907f9e683504c06b67c)

To reproduce

The issue is reproducible by letting the app continuously run inference. Since this is happening in the app, it's a bit hard to give a clear and easy MRE. If nothing comes up from the error logs alone I'll try to create a dummy app that reproduces the issue and share the code for it here.

Urgency

I don't know how urgent this issue is to ORT, but for our app it's quite urgent.

Platform

Android

OS Version

Android 14 (and other versions)

ONNX Runtime Installation

Built from Source

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

None

ONNX Runtime Version or Commit ID

v1.15.0

ONNX Runtime API

C++/C

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

skottmckay commented 1 week ago

Hard to say without a stack trace. with symbol names

ORT will do most allocations during model initialization and the first inference. After that it's using a cache for memory so segfaults would typically be an out-of-memory scenario or bad input (e.g. input tensor is freed while ORT is using it).

If you're building from source can you build a debug version? May need to ensure the Android build doesn't strip the binary of symbols though as typically it.

Does the issue happen if you run on the Android emulator? Would be easier to debug if it did.

Another option would be to copy onnxruntime_perf_test using adb to the phone (use /data/local/tmp), along with the model, and run. you can specify the number of iterations or amount of time to run for, and it can generate dummy input data.

laurenspriem commented 1 week ago

Hi @skottmckay thanks for your response.

I have created an MRE in the form of a demo app that has the bug. Please check out this repo. The bug is reproducible on Android emulator, it will crash anywhere in the range of 100-1000 inference runs, which should only take a few minutes to reach. Does this help in debugging?

I would like to provide a stack trace of the crash also, but I don't know how to get that on the native layer. Any pointers you can give me for that? In any case, I appreciate the help :)

Windsander commented 1 week ago

this issue same with : https://github.com/microsoft/onnxruntime/issues/21097

which I solved by including generated header files. In my case, it's caused by function mapping. maybe you can try. Hope it helps. 0x0

skottmckay commented 1 week ago

@laurenspriem is it reproducible by running onnxruntime_perf_test in a shell on the emulator? If so that would rule out the issue being in the flutter plugin you're using (which we don't own).

Use adb push <file> /data/local/tmp to copy onnxruntime_perf_test and your model to /data/local/tmp. Using adb shell, chmod +x /data/local/tmp/onnxruntime_perf_test to make it executable. cd /data/local/tmp. ./onnxruntime_perf_test -I -r 2000 <model.onnx> will run the model 1000 times, generating random input that matches the model inputs. If that does not crash, most likely the issue is with the flutter plugin.

May be possible to get symbols using ndk-stack: https://developer.android.com/ndk/guides/ndk-stack.html

laurenspriem commented 1 week ago

I am trying to run onnxruntime_perf_test in the emulator as you suggested. However, it stops and gives me the following text back:

/onnxruntime/onnxruntime/test/onnx/TestCase.cc:705 OnnxTestCase::OnnxTestCase(const std::string &, std::unique_ptr<TestModelInfo>, double, double) test case dir doesn't exist

Any clue what is going wrong?