ninia / jep

Embed Python in Java
Other
1.28k stars 145 forks source link

Segfault when running `np.linalg.inv` in Jep #474

Closed jonathanindig closed 1 year ago

jonathanindig commented 1 year ago

Describe the bug Running np.linalg.inv with a large enough matrix (10x10 is too small, 100x100 seems to do it) in the Jep console causes a segfault.

I ran into this on a variety of systems. Here's a reproduction with Docker:

$ docker run --rm -it ubuntu:22.04

--- now we're in the docker shell --- 

root@571f8d70fd83:/# uname -a
Linux 6a70263d3edc 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
root@571f8d70fd83:/# apt update
root@571f8d70fd83:/# apt install -y libssl-dev zlib1g-dev gdb gcc g++ make python3 python3-pip python3-dev python3.10-venv openjdk-17-jdk libhdf5-dev
root@571f8d70fd83:/# export JAVA_HOME="/usr/lib/jvm/java-17-openjdk-arm64/"
root@571f8d70fd83:/# pip install numpy jep
root@571f8d70fd83:/# ulimit -c unlimited # for core dumps
root@571f8d70fd83:/# jep
>>> import numpy as np
>>> np.linalg.inv(np.identity(1000))
Segmentation fault (core dumped)
root@571f8d70fd83:/# gdb $JAVA_HOME/bin/java core
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
... truncated ... 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `java -classpath /usr/local/lib/python3.10/dist-packages/jep/jep-4.1.1.jar -Djav'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffff2d6494b0 in dgetrf_parallel () from /usr/local/lib/python3.10/dist-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-cecebdce.3.21.so
[Current thread is 1 (Thread 0xffff8369f120 (LWP 6802))]
(gdb)

This can be mitigated by setting OPENBLAS_NUM_THREADS=1:

root@571f8d70fd83:/# OPENBLAS_NUM_THREADS=1 jep
>>> import numpy as np
>>> np.linalg.inv(np.identity(1000))
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

But this slows things down considerably (seemingly between 2x to 10x but we haven't fully benchmarked) compared to plain Python.

Any idea why this is happening with Jep and whether there's anything else we can do? Thanks!

bsteffensmeier commented 1 year ago

I am having some trouble recreating the issue on my machine. Based on the comments for an OpenBLAS issue I found I recommend adding -Xss4096k to the java command in the jep script.

jonathanindig commented 1 year ago

Thanks @bsteffensmeier - that does see to work! 🎉

I did find a similar issue (actually your comment https://github.com/ninia/jep/issues/241#issuecomment-753697396) and tried setting that on the machine I was using and it didn't work, so I gave up on that front... but @jeremyrsmith tried setting -Xss8192k on that machine which did the trick. The machines we were using have 8 cores, so maybe they need more memory (in case anyone else stumbles upon this and 4096 doesn't work for them).

Thanks for your help!