luhenry / netlib

An high-performance, hardware-accelerated implementation of Netlib in Java
Other
64 stars 12 forks source link

segfault on 2.2.0 in dgetrf on ubuntu x86_64 #2

Open dlwh opened 3 years ago

dlwh commented 3 years ago
import dev.ludovic.netlib.LAPACK;
import org.netlib.util.intW;

class Main {
  public static void main(String[] args) {
    double[] arr = new double[400];
    int[] piv = new int[20];
    intW info = new intW(0);
    LAPACK.getInstance().dgetrf(20, 20, arr, 20, piv, info);
  } 
}

reproduces in OpenJDK 64-Bit Server VM, Java 1.8.0_292 and OpenJDK 64-Bit Server VM, Java 16.0.1

There aren't any debug symbols and I'm no expert on assembly, but this is what I'm getting. the first instruction is the segfault.


0x7fffd85a262dmov    (%rsi),%eax
--
0x7fffd85a262flea    0x30(%rbp),%rsi
0x7fffd85a2633mov    $0x10000,%eax
0x7fffd85a2638and    0x4(%rsi),%eax
0x7fffd85a263bcmp    $0x10000,%eax
0x7fffd85a2641jne    0x7fffd85a26ce
0x7fffd85a2647mov    $0xe0,%eax
0x7fffd85a264cand    0x100(%rbp),%eax
0x7fffd85a2652cmp    $0xe0,%eax
0x7fffd85a2658jne    0x7fffd85a26ce
0x7fffd85a265elea    0x10(%rbp),%rsi
0x7fffd85a2662mov    (%rsi),%eax
0x7fffd85a2664cmp    $0x50654,%eax
0x7fffd85a266aje     0x7fffd85a26ce
0x7fffd85a2670lea    0x188(%rbp),%rsi
0x7fffd85a2677vmovdqu32 %zmm0,(%rsi)
0x7fffd85a267dvmovdqu32 %zmm7,0x40(%rsi)
0x7fffd85a2684vmovdqu32 %zmm8,0x80(%rsi)
0x7fffd85a268bvmovdqu32 %zmm31,0xc0(%rsi)
dlwh commented 3 years ago

also reproduces in version 2.0.0

dlwh commented 3 years ago

fwiw, i get a segfault for any dimension >= 18, but not before

dlwh commented 3 years ago

Also get a failure in sgetrf for dim >= 10

luhenry commented 3 years ago

Hi @dlwh, let me try to reproduce that locally, it's first time I see it.

luhenry commented 3 years ago

I can reproduce with OpenBLAS, but not with Intel MKL. I also can only reproduce if OPENBLAS_NUM_THREADS is greater than 1. I'm now looking at how dgetrf_parallel (the function in which the SIGSEGV is triggered) is invoked, and why it triggers anything.

luhenry commented 3 years ago

Here is what I'm observing. When calling dgetrf_ from Java, we have $rsp = 0x7ffff599d0e8. It then calls dgetrf_parallel a first time, which allocates arrays on the stack and changes $rsp = 0x7ffff5918ef8 (aka 541,168 bytes). It then calls dgetrf_parallel recursively a second time, which allocates arrays on the stack again and changes $rsp = 0x7ffff5894d80 (aka another 541,048 bytes). It then SIGSEGV when trying to store variables on the stack [1].

When accessing the current thread's stack size and stack base, we can clearly see that this is indeed a stack overflow:

(gdb) p (Thread::_thr_current)->_stack_size
$3 = 1052672
(gdb) p (Thread::_thr_current)->_stack_base
$4 = (address) 0x7ffff59a5000 "\177ELF\002\001\001\003"

(0x7ffff59a5000 - 1052672 = 0x7ffff58a4000, which is smaller than $rsp = 0x7ffff5894d80 on the last call to dgetrf_parallel)

Now, onto figuring out why dgetrf_parallel allocates so much stack on the stack, and whether it's reproducible with calls to liblapack.so straight from C.

Also, when setting -Xss10M (set the stack size to 10 MB), I can't reproduce the issue.

[1]

   0x00007fff2a145040 <+0>:     lea    0x8(%rsp),%r10
   0x00007fff2a145045 <+5>:     and    $0xffffffffffffff80,%rsp
   0x00007fff2a145049 <+9>:     mov    %rdi,%rax
   0x00007fff2a14504c <+12>:    mov    %rdx,%rsi
   0x00007fff2a14504f <+15>:    pushq  -0x8(%r10)
   0x00007fff2a145053 <+19>:    push   %rbp
   0x00007fff2a145054 <+20>:    mov    %rsp,%rbp // $rbp = $rsp
   0x00007fff2a145057 <+23>:    push   %r15
   0x00007fff2a145059 <+25>:    push   %r14
   0x00007fff2a14505b <+27>:    push   %r13
   0x00007fff2a14505d <+29>:    push   %r12
   0x00007fff2a14505f <+31>:    push   %r10
   0x00007fff2a145061 <+33>:    push   %rbx
   0x00007fff2a145062 <+34>:    sub    $0x840c0,%rsp // allocate stack frame of 0x840c0 = 540,864 bytes
=> 0x00007fff2a145069 <+41>:    mov    %rdi,-0x83fd0(%rbp) // $rbp[-0x83fd0] = $rdi // stack grows down so access with negative index is normal
luhenry commented 3 years ago

@dlwh this issue is a repeat of a previously encountered issue with Breeze and netlib-java (so prior to my change). I opened an issue on OpenBLAS.

In the meantime, the workarounds are the following:

I'm exploring the licensing implication of packaging a custom OpenBLAS in the library to avoid having to install it locally, similarly to numpy. That might be also be a longer term solution for this specific issue.

dlwh commented 3 years ago

Huh ok. Thanks! netlib-java stopped working on ubuntu 20.04 since they stopped shipping gfortran3 and I didn't think to try

On Thu, May 13, 2021 at 2:52 PM Ludovic Henry @.***> wrote:

@dlwh https://github.com/dlwh this issue is a repeat of a previously encountered issue with Breeze and netlib-java (so prior to my change). I opened an issue on OpenBLAS.

In the meantime, the workarounds are the following:

I'm exploring the licensing implication of packaging a custom OpenBLAS in the library to avoid having to install it locally, similarly to numpy. That might be also be a longer term solution for this specific issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/luhenry/netlib/issues/2#issuecomment-840856190, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLIN74T4SCB2ZK6EOS53TNRCZHANCNFSM44WXNMGQ .