Open dlwh opened 3 years ago
also reproduces in version 2.0.0
fwiw, i get a segfault for any dimension >= 18, but not before
Also get a failure in sgetrf for dim >= 10
Hi @dlwh, let me try to reproduce that locally, it's first time I see it.
I can reproduce with OpenBLAS, but not with Intel MKL. I also can only reproduce if OPENBLAS_NUM_THREADS
is greater than 1. I'm now looking at how dgetrf_parallel
(the function in which the SIGSEGV
is triggered) is invoked, and why it triggers anything.
Here is what I'm observing. When calling dgetrf_
from Java, we have $rsp = 0x7ffff599d0e8
. It then calls dgetrf_parallel
a first time, which allocates arrays on the stack and changes $rsp = 0x7ffff5918ef8
(aka 541,168 bytes). It then calls dgetrf_parallel
recursively a second time, which allocates arrays on the stack again and changes $rsp = 0x7ffff5894d80
(aka another 541,048 bytes). It then SIGSEGV when trying to store variables on the stack [1].
When accessing the current thread's stack size and stack base, we can clearly see that this is indeed a stack overflow:
(gdb) p (Thread::_thr_current)->_stack_size
$3 = 1052672
(gdb) p (Thread::_thr_current)->_stack_base
$4 = (address) 0x7ffff59a5000 "\177ELF\002\001\001\003"
(0x7ffff59a5000 - 1052672 = 0x7ffff58a4000
, which is smaller than $rsp = 0x7ffff5894d80
on the last call to dgetrf_parallel
)
Now, onto figuring out why dgetrf_parallel
allocates so much stack on the stack, and whether it's reproducible with calls to liblapack.so
straight from C.
Also, when setting -Xss10M
(set the stack size to 10 MB), I can't reproduce the issue.
[1]
0x00007fff2a145040 <+0>: lea 0x8(%rsp),%r10
0x00007fff2a145045 <+5>: and $0xffffffffffffff80,%rsp
0x00007fff2a145049 <+9>: mov %rdi,%rax
0x00007fff2a14504c <+12>: mov %rdx,%rsi
0x00007fff2a14504f <+15>: pushq -0x8(%r10)
0x00007fff2a145053 <+19>: push %rbp
0x00007fff2a145054 <+20>: mov %rsp,%rbp // $rbp = $rsp
0x00007fff2a145057 <+23>: push %r15
0x00007fff2a145059 <+25>: push %r14
0x00007fff2a14505b <+27>: push %r13
0x00007fff2a14505d <+29>: push %r12
0x00007fff2a14505f <+31>: push %r10
0x00007fff2a145061 <+33>: push %rbx
0x00007fff2a145062 <+34>: sub $0x840c0,%rsp // allocate stack frame of 0x840c0 = 540,864 bytes
=> 0x00007fff2a145069 <+41>: mov %rdi,-0x83fd0(%rbp) // $rbp[-0x83fd0] = $rdi // stack grows down so access with negative index is normal
@dlwh this issue is a repeat of a previously encountered issue with Breeze and netlib-java
(so prior to my change). I opened an issue on OpenBLAS.
In the meantime, the workarounds are the following:
-Xss10M
(set the Java threads' stack size to 10 Mbytes)OPENBLAS_NUM_THREADS=1
USE_ALLOC_HEAP
at https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L49I'm exploring the licensing implication of packaging a custom OpenBLAS in the library to avoid having to install it locally, similarly to numpy. That might be also be a longer term solution for this specific issue.
Huh ok. Thanks! netlib-java stopped working on ubuntu 20.04 since they stopped shipping gfortran3 and I didn't think to try
On Thu, May 13, 2021 at 2:52 PM Ludovic Henry @.***> wrote:
@dlwh https://github.com/dlwh this issue is a repeat of a previously encountered issue with Breeze and netlib-java (so prior to my change). I opened an issue on OpenBLAS.
In the meantime, the workarounds are the following:
- Increase the size of the stack of Java threads with -Xss10M (set the Java threads' stack size to 10 Mbytes)
- Make sure OpenBLAS doesn't use the parallel implementation by defining the environment variable OPENBLAS_NUM_THREADS=1
- Compile a custom version of OpenBLAS that unconditionally define USE_ALLOC_HEAP at https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L49
I'm exploring the licensing implication of packaging a custom OpenBLAS in the library to avoid having to install it locally, similarly to numpy. That might be also be a longer term solution for this specific issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/luhenry/netlib/issues/2#issuecomment-840856190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLIN74T4SCB2ZK6EOS53TNRCZHANCNFSM44WXNMGQ .
reproduces in OpenJDK 64-Bit Server VM, Java 1.8.0_292 and OpenJDK 64-Bit Server VM, Java 16.0.1
There aren't any debug symbols and I'm no expert on assembly, but this is what I'm getting. the first instruction is the segfault.