Closed Sunny-Anand closed 4 weeks ago
Pure Java code cannot segfault. The segfault should be in one of JVM itself, onnx-mlir JNI wrapper, or the native model code.
Install gdb in the container and run the model with java -XX:OnError="gdb %p" ...
. When the segfault happens, you should be dropped into gdb and do a where
to get a backtrace of where the segfault occurred.
@gongsu832 I did the above command but it doesn't help with respect to getting the backtrace back using gdb when the segfault occurs.
root@672f82d10170:/workdir# apt-get update
Hit:1 http://ports.ubuntu.com/ubuntu-ports jammy InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports jammy-updates InRelease [128 kB]
Hit:3 http://ports.ubuntu.com/ubuntu-ports jammy-backports InRelease
Get:4 http://ports.ubuntu.com/ubuntu-ports jammy-security InRelease [129 kB]
Get:5 http://ports.ubuntu.com/ubuntu-ports jammy-updates/universe s390x Packages [1107 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports jammy-updates/main s390x Packages [1130 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports jammy-security/universe s390x Packages [837 kB]
Fetched 3331 kB in 2s (1368 kB/s)
Reading package lists... Done
root@672f82d10170:/workdir# apt-get install gdb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
gdb is already the newest version (12.1-0ubuntu1~22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.
root@672f82d10170:/workdir# java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate
Iteration 0 dataset 0: Running
Segmentation fault (core dumped)
gdb requires ptrace (and maybe others) system capability being enabled in the container. Try to run your container with --privileged
and see if that helps.
Thanks @gongsu832 I tried the above suggestion also applied a few more.
Ran the container with the changed command adding --privileged
and --cap-add=SYS_PTRACE
and --security-opt=seccomp=unconfined
:
podman run --privileged --security-opt seccomp=unconfined --cap-add SYS_PTRACE --entrypoint bash -v /devfield/sunny/models:/data:Z -v /devfield/sunny/zdlc-main/dlc-automation/:/code/:Z -it --name onnx-mlir-sunny-java18-test onnx-mlir-dev:s390x-java8-test
apt-get update
apt-get install gdb
java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate
/etc/sysctl.d/10-ptrace.conf. changed kernel.yama.ptrace_scope=0 based on another suggestion as it was set initially
java -XX:OnError="gdb - %p" -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate
Iteration 0 dataset 0: Running
Segmentation fault (core dumped)
--privileged
should be the only option you need. It gives all capabilities to the container. Can you verify that gdb is actually working in your container? Run gdb ls
in your container and then run
at gdb prompt. What output do you get?
root@a29e01bde6cd:/workdir# gdb ls
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "s390x-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ls...
(No debugging symbols found in ls)
(gdb) run
Starting program: /usr/bin/ls
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
llvm-project onnx-mlir
[Inferior 1 (process 501) exited normally]
(gdb)
Reading Oracle doc https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html it appears that the -XX:OnError
is only for catching fatal errors in the JVM itself, not in the user code.
You can try to run the java command directly under gdb with gdb --args java -cp ...
and see if it gives something useful. Just be aware that JVM actully makes use of SIGSEGV itself so gdb might catch "legitimate" segfault in the JVM instead of the segfault in user code.
Also, the openjdk 1.8 JVM on z is interpret only, which means it's not capable of doing JIT. No production system should be using it really. Instead, you should try the IBM Semeru JVM https://developer.ibm.com/languages/java/semeru-runtimes/downloads/
https://github.com/onnx/onnx-mlir/pull/2961 fixed the reported issue.
Java models failure, Steps to reproduce the issue
Issue Symptom: Some onnx compiled models as Jar models are failing with segmentation fault error when being used for inference in containers containing openjdk-1.8 when onnx-mlir and its jin was also compiled with openjdk-1.8. List of models can be found below.
Steps to reproduce on the community image:
The onnx-mlir container by default is using the openjdk-11 version for compiling models as Jar based model files to be used for running inside a java application for inferencing(https://github.com/onnx/onnx-mlir/blob/main/docker/Dockerfile.llvm-project). This can be verified using the META-INF of the jar file. Updating the openjdk to 8 and re-building the onnx-mlir to be built with open-jdk 8.
Run the inference using a java client now which is running using openjdk1.8(https://github.com/IBM/zDLC/blob/main/code/deep_learning_compiler_run_model_example.java ) inside the container.
Compile the model with the below onnx-mlir options
--O3 --EmitJNI --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ./version-RFB-320.onnx --onnx-op-stats TXT
Run the inference using the java client which is built using open-jdk 1.8
root@07f587daae00:/workdir/onnx-mlir/build# java -version
openjdk version "1.8.0_422" OpenJDK Runtime Environment (build 1.8.0_422-8u422-b05-1~22.04-b05) OpenJDK 64-Bit Zero VM (build 25.422-b05, interpreted mode)
root@07f587daae00:/workdir/onnx-mlir/build# java -classpath /code/client/class:/data/version-RFB-320.jar -Djava.library.path=/code/client/bin modelzoo --file /data/version-RFB-320.tests --fc-parms 0.01,0.0,1,10 --validate Iteration 0 dataset 0: Running Segmentation fault (core dumped)
Models Impacted:
pointilism-8 rain-princess-8 version-RFB-320 version-RFB-640 shufflenet-6 shufflenet-7 shufflenet-8 shufflenet-9 candy-8 candy-9 mosaic-8 mosaic-9 pointilism-9 rain-princess-9 udnie-8 udnie-9
Impact : Gates 0.4.3.0 release