onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
760 stars 319 forks source link

Java 1.8 models segmentation fault when doing inference using a java client #2947

Closed Sunny-Anand closed 4 weeks ago

Sunny-Anand commented 1 month ago

Java models failure, Steps to reproduce the issue

Issue Symptom: Some onnx compiled models as Jar models are failing with segmentation fault error when being used for inference in containers containing openjdk-1.8 when onnx-mlir and its jin was also compiled with openjdk-1.8. List of models can be found below.

Steps to reproduce on the community image:

  1. The onnx-mlir container by default is using the openjdk-11 version for compiling models as Jar based model files to be used for running inside a java application for inferencing(https://github.com/onnx/onnx-mlir/blob/main/docker/Dockerfile.llvm-project). This can be verified using the META-INF of the jar file. Updating the openjdk to 8 and re-building the onnx-mlir to be built with open-jdk 8.

  2. Run the inference using a java client now which is running using openjdk1.8(https://github.com/IBM/zDLC/blob/main/code/deep_learning_compiler_run_model_example.java ) inside the container.

    Compile the model with the below onnx-mlir options --O3 --EmitJNI --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ./version-RFB-320.onnx --onnx-op-stats TXT

    Run the inference using the java client which is built using open-jdk 1.8

root@07f587daae00:/workdir/onnx-mlir/build# java -version
openjdk version "1.8.0_422" OpenJDK Runtime Environment (build 1.8.0_422-8u422-b05-1~22.04-b05) OpenJDK 64-Bit Zero VM (build 25.422-b05, interpreted mode)

root@07f587daae00:/workdir/onnx-mlir/build# java -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate Iteration 0 dataset 0: Running Segmentation fault (core dumped)

Models Impacted:

pointilism-8 rain-princess-8 version-RFB-320 version-RFB-640 shufflenet-6 shufflenet-7 shufflenet-8 shufflenet-9 candy-8 candy-9 mosaic-8 mosaic-9 pointilism-9 rain-princess-9 udnie-8 udnie-9

Impact : Gates 0.4.3.0 release

gongsu832 commented 1 month ago

Pure Java code cannot segfault. The segfault should be in one of JVM itself, onnx-mlir JNI wrapper, or the native model code.

Install gdb in the container and run the model with java -XX:OnError="gdb %p" .... When the segfault happens, you should be dropped into gdb and do a where to get a backtrace of where the segfault occurred.

Sunny-Anand commented 1 month ago

@gongsu832 I did the above command but it doesn't help with respect to getting the backtrace back using gdb when the segfault occurs.

root@672f82d10170:/workdir# apt-get update
Hit:1 http://ports.ubuntu.com/ubuntu-ports jammy InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports jammy-updates InRelease [128 kB]
Hit:3 http://ports.ubuntu.com/ubuntu-ports jammy-backports InRelease
Get:4 http://ports.ubuntu.com/ubuntu-ports jammy-security InRelease [129 kB]
Get:5 http://ports.ubuntu.com/ubuntu-ports jammy-updates/universe s390x Packages [1107 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main s390x Packages [1130 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports jammy-security/universe s390x Packages [837 kB]
Fetched 3331 kB in 2s (1368 kB/s)
Reading package lists... Done
root@672f82d10170:/workdir# apt-get install gdb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
gdb is already the newest version (12.1-0ubuntu1~22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.
root@672f82d10170:/workdir# java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate
Iteration 0 dataset 0: Running
Segmentation fault (core dumped)
gongsu832 commented 1 month ago

gdb requires ptrace (and maybe others) system capability being enabled in the container. Try to run your container with --privileged and see if that helps.

Sunny-Anand commented 1 month ago

Thanks @gongsu832 I tried the above suggestion also applied a few more. Ran the container with the changed command adding --privileged and --cap-add=SYS_PTRACE and --security-opt=seccomp=unconfined :

podman run --privileged --security-opt seccomp=unconfined --cap-add SYS_PTRACE --entrypoint bash -v /devfield/sunny/models:/data:Z -v /devfield/sunny/zdlc-main/dlc-automation/:/code/:Z  -it --name onnx-mlir-sunny-java18-test  onnx-mlir-dev:s390x-java8-test
apt-get update
apt-get install gdb
java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate

 /etc/sysctl.d/10-ptrace.conf. changed kernel.yama.ptrace_scope=0 based on another suggestion as it was set initially

 java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate
Iteration 0 dataset 0: Running
Segmentation fault (core dumped)
gongsu832 commented 1 month ago

--privileged should be the only option you need. It gives all capabilities to the container. Can you verify that gdb is actually working in your container? Run gdb ls in your container and then run at gdb prompt. What output do you get?

Sunny-Anand commented 1 month ago
root@a29e01bde6cd:/workdir# gdb ls
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "s390x-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ls...
(No debugging symbols found in ls)
(gdb) run
Starting program: /usr/bin/ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
llvm-project  onnx-mlir
[Inferior 1 (process 501) exited normally]
(gdb)
gongsu832 commented 1 month ago

Reading Oracle doc https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html it appears that the -XX:OnError is only for catching fatal errors in the JVM itself, not in the user code.

You can try to run the java command directly under gdb with gdb --args java -cp ... and see if it gives something useful. Just be aware that JVM actully makes use of SIGSEGV itself so gdb might catch "legitimate" segfault in the JVM instead of the segfault in user code.

gongsu832 commented 1 month ago

Also, the openjdk 1.8 JVM on z is interpret only, which means it's not capable of doing JIT. No production system should be using it really. Instead, you should try the IBM Semeru JVM https://developer.ibm.com/languages/java/semeru-runtimes/downloads/

Sunny-Anand commented 4 weeks ago

https://github.com/onnx/onnx-mlir/pull/2961 fixed the reported issue.