open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

Persistent communication request provokes segfault in Java bindings. #369

Closed osvegis closed 9 years ago

osvegis commented 9 years ago

MPI corrupts the memory space of Java. The following example provokes segfault in Java bindings. Please, see the comments in the example.

import mpi.*;                                                                         
import java.nio.*;                                                                    

public class CrashTest
{
    private static final int STEPS = 1000000000,
        SIZE  = 4096;

    public static void main(String...args) throws MPIException
    {
        MPI.Init(args);
        int rank = MPI.COMM_WORLD.getRank();
        StringBuilder s = new StringBuilder();

        if(MPI.COMM_WORLD.getSize() != 2)
            throw new MPIException("I need exactly 2 processes.");

        // Only one buffer is needed,                                                 
        // but the test works ok if you only use one.                                 
        ByteBuffer sendBuf = MPI.newByteBuffer(SIZE),
            recvBuf = MPI.newByteBuffer(SIZE);

        Prequest req = MPI.COMM_WORLD.recvInit(recvBuf, SIZE, MPI.BYTE, 0, 0);

        for(int i = 1; i <= STEPS; i++)
            {
                // Allocate memory to provoke GC work and crash.                      
                // If you comment the following line, the test works ok.              
                (s = new StringBuilder(SIZE).append(i)).trimToSize();

                if(rank == 0)
                    {
                        if(i % 100000 == 0)
                            {
                                s.setLength(0);
                                System.out.println(i + s.toString());
                            }

                        MPI.COMM_WORLD.send(sendBuf, SIZE, MPI.BYTE, 1, 0);
                    }
                else
                    {
                        req.start();
                        req.waitFor();
                    }
            }

        MPI.Finalize();
    }

} // CrashTest
jsquyres commented 9 years ago

@osvegis I'm unable to get this test to fail for me.

Should I be seeing the memory usage of the process increase as the process runs?

I increased SIZE to 1048576, and after running 2 procs of this on a single 128GB server, the memory usage (as reported by "top") still only shows 5.1% memory usage for both java processes (it's been 5.1% the whole time):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                 
19701 jsquyres  20   0 18.7g 3.2g  13m S 100.0  5.1   5:34.79 java                    
19702 jsquyres  20   0 18.7g 3.2g  13m S 100.0  5.1   5:34.47 java
jsquyres commented 9 years ago

Forgot to mention that I tried both master and the v1.8 branch -- so I'm guessing I'm not triggering the GC, and therefore not triggering the problem.

osvegis commented 9 years ago

I also tried on master and the test crashes. This problem may exist from the very beginning of java bindings integration: ompi-1.7.5 also fails. I think the problem is not related with buffer size. Regarding memory usage, the process size may increase but not necessarily.

jsquyres commented 9 years ago

Hmm. Ok. How do I get this test to reproduce, then? Is there a way that I can know for sure that the GC has fired?

shurickdaryin commented 9 years ago

@jsquyres You could insert an explicit call to System.gc() into the for loop (maybe with if( i %1024 == 0) so that it is not called on each iteration). Alternatively, you could use -Xmx java parameter to set maximum heap size (see http://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html).

osvegis commented 9 years ago

Don't worry about if GC is fired. GC is fired when it is necessary, because on each iteration we allocate memory, so if GC is not fired we get an out of memory error. On my tests sometimes crash Java, and sometimes crash an MPI call. Maybe you can provoke the error in a more modest machine.

jsquyres commented 9 years ago

:-(

I'm still totally unable to reproduce this bug -- even on my OS X laptop (with only 16GB RAM).

I don't doubt that there is a real issue here, but I'm somewhat stymied until I can reproduce it reliably...

osvegis commented 9 years ago

My machine is 3570T, with only 8GB RAM. Linux shuttle 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64 GNU/Linux

jsquyres commented 9 years ago

My laptop is the only machine I have access too with only 16GB -- I don't have access to anything with less memory than that. :-(

FWIW, I even put in the call to System.gc(), but that didn't trigger the issue, either.

Any suggestions?

osvegis commented 9 years ago

What do you think about of a virtual machine?

jsquyres commented 9 years ago

Interesting idea. Let me see what I can fire up around here...

On Feb 9, 2015, at 5:27 PM, Oscar Vega-Gisbert notifications@github.com wrote:

What do you think about of a virtual machine?

— Reply to this email directly or view it on GitHub.

Jeff Squyres jsquyres@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

goodell commented 9 years ago

@osvegis is there any additional debugging you could do on your end? Could you send along the console output and backtrace(s) on the SEGV? I think with Hotspot there is a way to get a more verbose error log: http://www.oracle.com/technetwork/java/javase/felog-138657.html

You might also try running your test program with -Xcheck:jni (http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtq)

There are some other general troubleshooting suggestions here: http://www.oracle.com/technetwork/java/javase/crashes-137240.html

Jeff, you might also have trouble reproducing if you are running a different version of the JVM or are running on a different platform. It might be very difficult for you to directly reproduce this problem yourself.

osvegis commented 9 years ago

El 09/02/15 a las 23:34, Dave Goodell escribió:

@osvegis https://github.com/osvegis is there any additional debugging you could do on your end? Could you send along the console output and backtrace(s) on the SEGV? I think with Hotspot there is a way to get a more verbose error log: http://www.oracle.com/technetwork/java/javase/felog-138657.html

You might also try running your test program with |-Xcheck:jni| (http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtq)

There are some other general troubleshooting suggestions here: http://www.oracle.com/technetwork/java/javase/crashes-137240.html

Jeff, you might also have trouble reproducing if you are running a different version of the JVM or are running on a different platform. It might be very difficult for you to directly reproduce this problem yourself.

The following execution generated an error file: hs_err_pid####.log

$ mpirun -np 2 java -Xcheck:jni -cp build/classes/ CrashTest
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
*** glibc detected *** java: free(): corrupted unsorted chunks: 
0x0000000002556420 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fdf07437a16]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7fdf0743c7bc]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x2b5d80)[0x7fdf06671d80]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8b2ee0)[0x7fdf06c6eee0]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8dc4d8)[0x7fdf06c984d8]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x7b04d2)[0x7fdf06b6c4d2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7fdf07b65b50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdf0749d70d]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00600000-00601000 r--p 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00601000-00602000 rw-p 00001000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
02097000-02673000 rw-p 00000000 00:00 0                                  
[heap]
77a200000-77b700000 rw-p 00000000 00:00 0
77b700000-784800000 rw-p 00000000 00:00 0
784800000-789a00000 rw-p 00000000 00:00 0
789a00000-7d6d00000 rw-p 00000000 00:00 0
7d6d00000-7fff80000 rw-p 00000000 00:00 0
7fff80000-800000000 ---p 00000000 00:00 0
7fdee75d1000-7fdee75dd000 r-xp 00000000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee75dd000-7fdee77dd000 ---p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee77dd000-7fdee77de000 rw-p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee77de000-7fdeef7df000 rw-s 00000000 08:05 15597816 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/shared_mem_pool.shuttle 
(deleted)
7fdeef7df000-7fdeefbe0000 rw-s 00000000 08:05 15597823 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/0/vader_segment.shuttle.0
7fdeefbe0000-7fdeeffe1000 rw-s 00000000 08:05 15597819 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/1/vader_segment.shuttle.1
7fdeeffe1000-7fdeeffe8000 r-xp 00000000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdeeffe8000-7fdef01e8000 ---p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdef01e8000-7fdef01ea000 rw-p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdef01ea000-7fdef0207000 r-xp 00000000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0207000-7fdef0407000 ---p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0407000-7fdef0409000 rw-p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0409000-7fdef0434000 r-xp 00000000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0434000-7fdef0634000 ---p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0634000-7fdef0635000 rw-p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0635000-7fdef0636000 rw-p 00000000 00:00 0
7fdef0636000-7fdef063e000 r-xp 00000000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef063e000-7fdef083d000 ---p 00008000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef083d000-7fdef083e000 rw-p 00007000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef083e000-7fdef0841000 r-xp 00000000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0841000-7fdef0a40000 ---p 00003000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0a40000-7fdef0a41000 rw-p 00002000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0a41000-7fdef0a63000 r-xp 00000000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0a63000-7fdef0c63000 ---p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0c63000-7fdef0c64000 rw-p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0c64000-7fdef0c75000 r-xp 00000000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef0c75000-7fdef0e75000 ---p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef0e75000-7fdef0e76000 rw-p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef1082000-7fdef10a7000 r-xp 00000000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef10a7000-7fdef12a7000 ---p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef12a7000-7fdef12a9000 rw-p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef12b1000-7fdef12b6000 r-xp 00000000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef12b6000-7fdef14b5000 ---p 00005000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef14b5000-7fdef14b6000 rw-p 00004000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef14e0000-7fdef14e4000 r-xp 00000000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef14e4000-7fdef16e4000 ---p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef16e4000-7fdef16e5000 rw-p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef16e5000-7fdef16f1000 r-xp 00000000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
7fdef16f1000-7fdef18f0000 ---p 0000c000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
7fdef18f0000-7fdef18f3000 rw-p 0000b000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
[shuttle:06363] *** Process received signal ***
[shuttle:06363] Signal: Abortado (6)
[shuttle:06363] Signal code:  (-6)
[shuttle:06363] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fdf07b6e0a0]
[shuttle:06363] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fdf073f3165]
[shuttle:06363] [ 2] 
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7fdf073f63e0]
[shuttle:06363] [ 3] 
/lib/x86_64-linux-gnu/libc.so.6(+0x6d1cb)[0x7fdf0742e1cb]
[shuttle:06363] [ 4] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fdf07437a16]
[shuttle:06363] [ 5] 
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7fdf0743c7bc]
[shuttle:06363] [ 6] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x2b5d80)[0x7fdf06671d80]
[shuttle:06363] [ 7] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8b2ee0)[0x7fdf06c6eee0]
[shuttle:06363] [ 8] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8dc4d8)[0x7fdf06c984d8]
[shuttle:06363] [ 9] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x7b04d2)[0x7fdf06b6c4d2]
[shuttle:06363] [10] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7fdf07b65b50]
[shuttle:06363] [11] 
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdf0749d70d]
[shuttle:06363] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node shuttle exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

Sometimes the crash is different. A SIGSEGV error appears, but the test continues!

$ mpirun -np 2 java -Xcheck:jni -cp build/classes/ CrashTest
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f0986a8424e, pid=6789, tid=139678682806016
#
# JRE version: OpenJDK Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: OpenJDK 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 
compressed oops)
# Derivative: IcedTea 2.5.4
# Distribution: Debian GNU/Linux 7.6 (wheezy), package 7u75-2.5.4-1~deb7u1
# Problematic frame:
# C  [libc.so.6+0x7924e]
#
# Failed to write core dump. Core dumps have been disabled. To enable 
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/oscar/NetBeansProjects/mpi-pruebas/hs_err_pid6789.log
1100000
1200000
1300000
1400000
1500000
1600000
...
goodell commented 9 years ago

You might be able to find the problem with a malloc debugger (like DUMA, a fork of Electric Fence: http://duma.sourceforge.net/).

osvegis commented 9 years ago

El 10/02/15 a las 00:15, Dave Goodell escribió:

You might be able to find the problem with a malloc debugger (like DUMA, a fork of Electric Fence: http://duma.sourceforge.net/).

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/369#issuecomment-73611924.

I get the following error:

$ export LD_PRELOAD=libduma.so.0.0.0

$ mpirun -np 2 duma java -cp ~/ompi-install/lib/mpi.jar:. CrashTest

DUMA 2.5.15 (shared library, NO_LEAKDETECTION) Copyright (C) 2006 Michael Eddington meddington@gmail.com Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA 2.5.15 (shared library, NO_LEAKDETECTION) Copyright (C) 2006 Michael Eddington meddington@gmail.com Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA Aborting: malloc() is not bound to duma. DUMA Aborting: Preload lib with 'LD_PRELOAD=libduma.so '.

DUMA 2.5.15 (shared library, NO_LEAKDETECTION) Copyright (C) 2006 Michael Eddington meddington@gmail.com Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA Aborting: malloc() is not bound to duma. DUMA Aborting: Preload lib with 'LD_PRELOAD=libduma.so '.

goodell commented 9 years ago

I think OMPI's malloc hooks for memory registration are getting in the way. Try setting export OMPI_MCA_memory_linux_disable=1 in your environment also. This MCA parameter must be set as an environment variable, it will not work if set by some other mechanism.

osvegis commented 9 years ago

I get the same error. Maybe Java uses its own malloc and so Duma is not allowed.

jsquyres commented 9 years ago

Hmm. How about trying the same thing with a pure Java program (i.e., non MPI)? That would tell us if Java is interposing its own malloc.

osvegis commented 9 years ago

The same error.

jsquyres commented 9 years ago

Is there a way to tell if this same error happens with multiple different versions of the JVM?

osvegis commented 9 years ago

I have 1.7.0_75. I remember that Alexander had 1.8.0_25.

jsquyres commented 9 years ago

On my OSX machine:

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

On my Linux machine:

$ java -version
java version "1.6.0_32"
OpenJDK Runtime Environment (IcedTea6 1.13.4) (rhel-6.1.13.4.el6_5-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
jsquyres commented 9 years ago

Just curious: does the same problem happen if you only use the TCP BTL (i.e., not the shared memory BTL)?

$ mpirun --mca btl tcp,self ...
osvegis commented 9 years ago

I thought it worked ok, but after 4 attempts:

$ mpirun -np 2 --mca btl tcp,self java -cp build/classes/ CrashTest
100000
200000
300000
400000
*** glibc detected *** java: corrupted double-linked list: 
0x0000000002585b00 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fa7a55a1a16]
/lib/x86_64-linux-gnu/libc.so.6(+0x76e8d)[0x7fa7a55a1e8d]
/lib/x86_64-linux-gnu/libc.so.6(+0x79174)[0x7fa7a55a4174]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7fa7a55a68a0]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_malloc+0x5e)[0x7fa79835ec79]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_seg_alloc+0x2f)[0x7fa7937208da]
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc+0x114)[0x7fa794c2b5b3]
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_wrapper+0x36)[0x7fa794c2b1aa]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x11ea9)[0x7fa793729ea9]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x12c33)[0x7fa79372ac33]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x1b8)[0x7fa79372a11b]
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so(+0xa4e6)[0x7fa793d854e6]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9cd3e)[0x7fa798372d3e]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9ce4d)[0x7fa798372e4d]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9d11c)[0x7fa79837311c]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x2ab)[0x7fa798373783]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_progress+0x88)[0x7fa79830cea1]
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50b5b)[0x7fa79894db5b]
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50c00)[0x7fa79894dc00]
/home/oscar/ompi-install/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7fa79894dc50]
/home/oscar/ompi-install/lib/libmpi.so.0(PMPI_Wait+0x130)[0x7fa7989a4627]
/home/oscar/ompi-install/lib/libmpi_java.so.0.0.0(Java_mpi_Request_waitFor+0x2d)[0x7fa798c6f367]
[0x7fa79fd6d088]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00600000-00601000 r--p 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00601000-00602000 rw-p 00001000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
02101000-026f3000 rw-p 00000000 00:00 0                                  
[heap]
77a200000-77b700000 rw-p 00000000 00:00 0
77b700000-784800000 rw-p 00000000 00:00 0
784800000-789a00000 rw-p 00000000 00:00 0
789a00000-7d6d00000 rw-p 00000000 00:00 0
7d6d00000-7f0100000 rw-p 00000000 00:00 0
7f0100000-7f0280000 ---p 00000000 00:00 0
7f0280000-800000000 rw-p 00000000 00:00 0
7fa78c000000-7fa78c021000 rw-p 00000000 00:00 0
7fa78c021000-7fa790000000 ---p 00000000 00:00 0
7fa79246a000-7fa792476000 r-xp 00000000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792476000-7fa792676000 ---p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792676000-7fa792677000 rw-p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792677000-7fa79267e000 r-xp 00000000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa79267e000-7fa79287e000 ---p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa79287e000-7fa792880000 rw-p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa792880000-7fa79289d000 r-xp 00000000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa79289d000-7fa792a9d000 ---p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa792a9d000-7fa792a9f000 rw-p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa792a9f000-7fa792aca000 r-xp 00000000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792aca000-7fa792cca000 ---p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792cca000-7fa792ccb000 rw-p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792ccb000-7fa792ccc000 rw-p 00000000 00:00 0
7fa792ccc000-7fa792cd4000 r-xp 00000000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792cd4000-7fa792ed3000 ---p 00008000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792ed3000-7fa792ed4000 rw-p 00007000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792ed4000-7fa792ed7000 r-xp 00000000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa792ed7000-7fa7930d6000 ---p 00003000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa7930d6000-7fa7930d7000 rw-p 00002000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa7930d7000-7fa7930f9000 r-xp 00000000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7930f9000-7fa7932f9000 ---p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7932f9000-7fa7932fa000 rw-p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7932fa000-7fa79330b000 r-xp 00000000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa79330b000-7fa79350b000 ---p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa79350b000-7fa79350c000 rw-p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa793718000-7fa79373d000 r-xp 00000000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa79373d000-7fa79393d000 ---p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa79393d000-7fa79393f000 rw-p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa793947000-7fa79394c000 r-xp 00000000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa79394c000-7fa793b4b000 ---p 00005000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa793b4b000-7fa793b4c000 rw-p 00004000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa793b76000-7fa793b7a000 r-xp 00000000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793b7a000-7fa793d7a000 ---p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793d7a000-7fa793d7b000 rw-p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793d7b000-7fa793d8d000 r-xp 00000000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793d8d000-7fa793f8c000 ---p 00012000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793f8c000-7fa793f8e000 rw-p 00011000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793f8e000-7fa79400e000 rw-p 00000000 00:00 0
7fa79400e000-7fa794012000 r-xp 00000000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794012000-7fa794212000 ---p 00004000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794212000-7fa794213000 rw-p 00004000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794213000-7fa794218000 r-xp 00000000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794218000-7fa794418000 ---p 00005000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794418000-7fa794419000 rw-p 00005000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794419000-7fa79441b000 r-xp 00000000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79441b000-7fa79461a000 ---p 00002000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79461a000-7fa79461b000 rw-p 00001000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79461b000-7fa79461d000 r-xp 00000000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79461d000-7fa79481d000 ---p 00002000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79481d000-7fa79481e000 rw-p 00002000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79481e000-7fa794823000 r-xp 00000000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794823000-7fa794a23000 ---p 00005000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794a23000-7fa794a24000 rw-p 00005000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794a24000-7fa794a2a000 r-xp 00000000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794a2a000-7fa794c29000 ---p 00006000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794c29000-7fa794c2a000 rw-p 00005000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794c2a000-7fa794c2d000 r-xp 00000000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794c2d000-7fa794e2c000 ---p 00003000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794e2c000-7fa794e2d000 rw-p 00002000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794e2d000-7fa794e33000 r-xp 00000000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa794e33000-7fa795032000 ---p 00006000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa795032000-7fa795033000 rw-p 00005000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa795033000-7fa795037000 r-xp 00000000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795037000-7fa795237000 ---p 00004000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795237000-7fa795238000 rw-p 00004000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795238000-7fa79523d000 r-xp 00000000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79523d000-7fa79543d000 ---p 00005000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79543d000-7fa79543e000 rw-p 00005000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79543e000-7fa79544f000 r-xp 00000000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa79544f000-7fa79564f000 ---p 00011000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa79564f000-7fa795650000 rw-p 00011000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa795650000-7fa795669000 r-xp 00000000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa795669000-7fa795869000 ---p 00019000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa795869000-7fa79586a000 rw-p 00019000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa79586b000-7fa79586e000 r-xp 00000000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa79586e000-7fa795a6d000 ---p 00003000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa795a6d000-7fa795a6e000 rw-p 00002000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa795a6e000-7fa795a74000 r-xp 00000000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795a74000-7fa795c73000 ---p 00006000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795c73000-7fa795c74000 rw-p 00005000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795c74000-7fa795c78000 r-xp 00000000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795c78000-7fa795e77000 ---p 00004000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795e77000-7fa795e78000 rw-p 00003000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795e78000-7fa795e7a000 r-xp 00000000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa795e7a000-7fa79607a000 ---p 00002000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa79607a000-7fa79607b000 rw-p 00002000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa79607b000-7fa79607d000 r-xp 00000000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79607d000-7fa79627c000 ---p 00002000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79627c000-7fa79627d000 rw-p 00001000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79627d000-7fa79627e000 ---p 00000000 00:00 0
7fa79627e000-7fa796a7e000 rw-p 00000000 00:00 0
7fa796a7e000-7fa796a7f000 ---p 00000000 00:00 0
7fa796a7f000-7fa79727f000 rw-p 00000000 00:00 0
7fa79727f000-7fa797294000 r-xp 00000000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so
7fa797294000-7fa797494000 ---p 00015000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so
7fa797494000-7fa797495000 rw-p 00015000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so[shuttle:04178] 
*** Process received signal ***
[shuttle:04178] Signal: Abortado (6)
[shuttle:04178] Signal code:  (-6)
[shuttle:04178] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fa7a5cd80a0]
[shuttle:04178] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa7a555d165]
[shuttle:04178] [ 2] 
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7fa7a55603e0]
[shuttle:04178] [ 3] 
/lib/x86_64-linux-gnu/libc.so.6(+0x6d1cb)[0x7fa7a55981cb]
[shuttle:04178] [ 4] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fa7a55a1a16]
[shuttle:04178] [ 5] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76e8d)[0x7fa7a55a1e8d]
[shuttle:04178] [ 6] 
/lib/x86_64-linux-gnu/libc.so.6(+0x79174)[0x7fa7a55a4174]
[shuttle:04178] [ 7] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7fa7a55a68a0]
[shuttle:04178] [ 8] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_malloc+0x5e)[0x7fa79835ec79]
[shuttle:04178] [ 9] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_seg_alloc+0x2f)[0x7fa7937208da]
[shuttle:04178] [10] 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc+0x114)[0x7fa794c2b5b3]
[shuttle:04178] [11] 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_wrapper+0x36)[0x7fa794c2b1aa]
[shuttle:04178] [12] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x11ea9)[0x7fa793729ea9]
[shuttle:04178] [13] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x12c33)[0x7fa79372ac33]
[shuttle:04178] [14] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x1b8)[0x7fa79372a11b]
[shuttle:04178] [15] 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so(+0xa4e6)[0x7fa793d854e6]
[shuttle:04178] [16] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9cd3e)[0x7fa798372d3e]
[shuttle:04178] [17] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9ce4d)[0x7fa798372e4d]
[shuttle:04178] [18] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9d11c)[0x7fa79837311c]
[shuttle:04178] [19] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x2ab)[0x7fa798373783]
[shuttle:04178] [20] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_progress+0x88)[0x7fa79830cea1]
[shuttle:04178] [21] 
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50b5b)[0x7fa79894db5b]
[shuttle:04178] [22] 
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50c00)[0x7fa79894dc00]
[shuttle:04178] [23] 
/home/oscar/ompi-install/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7fa79894dc50]
[shuttle:04178] [24] 
/home/oscar/ompi-install/lib/libmpi.so.0(PMPI_Wait+0x130)[0x7fa7989a4627]
[shuttle:04178] [25] 
/home/oscar/ompi-install/lib/libmpi_java.so.0.0.0(Java_mpi_Request_waitFor+0x2d)[0x7fa798c6f367]
[shuttle:04178] [26] [0x7fa79fd6d088]
[shuttle:04178] *** End of error message ***
[shuttle][[51156,1],0][btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Conexión reinicializada por la 
máquina remota (104)
[shuttle:04177] pml_ob1_sendreq.c:187 FATAL
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node shuttle exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------
jsquyres commented 9 years ago

I'm kinda running out of ideas here -- how do we track this down?

goodell commented 9 years ago

I'd still proceed by trying to get some memory debugger like DUMA, ElectricFence, NJAMD, etc. that will add red zones to all allocations and trap accesses with mprotect() or similar. I'd probably poke at DUMA first using LD_DEBUG to figure out why the LD_PRELOAD isn't having the expected effect. Something like LD_DEBUG=bindings,reloc,symbols,libs will probably give you a lot of info, but is likely to show what is happening.

You could also try running Valgrind on the JVM, but it will probably be tricky unless you have a bit of experience with Valgrind. There's a SO post with some suggestions here: http://stackoverflow.com/questions/9216815/valgrind-and-java

osvegis commented 9 years ago

I tested Valgrind and it only runs disabling the hot spot compiler (-Xint). Valgrid shows a lot of errors until the program begins to operate, but the program doesn't crash and no more errors appear. In fact, the example reported here:

http://www.open-mpi.org/community/lists/users/2015/01/26215.php

also works ok disabling the hot spot compiler (java -Xint ...). So the problem should be related with signals and exception handling:

http://www.open-mpi.org/community/lists/users/2015/02/26318.php

But if I preload libjsig.so, the program still crashes.

goodell commented 9 years ago

The behavior under Valgrind is likely to be different than the regular behavior (e.g., the program may not crash). Valgrind changes quite a bit about how the program operates in some cases. Can you post the full Valgrind output in a gist and we can quickly check to see if anything suspicious is in there?

osvegis commented 9 years ago

Here you have:

https://gist.github.com/osvegis/2a55bb4f78b0bf5b678a

I've used Java 8 because Valgrind works ok with Java 8 hot spot. The CrashTest example doesn't crash in Java 8, but the original test, that was reported by user, crashes:

http://www.open-mpi.org/community/lists/users/2015/01/26215.php

goodell commented 9 years ago

Sadly, neither of them shows Valgrind detecting any problems. The bugs are still happening from glibc's internal consistency checks, which usually means that Valgrind isn't properly intercepting malloc() and friends. You could go down the path of turning up Valgrind's debug logging that shows symbol interception and dynamic loading, but this may still not work in the end.

I haven't written Java in years now, and I have essentially zero experience getting it to work with Valgrind. Sorry that I don't have any other suggestions at this point.

rhc54 commented 9 years ago

I won't block a release for Java bindings as they are optional. Still, I'd like to see it resolved if possible, though I have no brilliant suggestions

osvegis commented 9 years ago

I think the problem is on signal chaining. Open MPI uses an internal libevent 2.0.22. Maybe this version prevents the Java signal-chaining facility, so it is replacing the Java HotSpot VM's signal handlers. http://docs.oracle.com/javase/8/docs/technotes/guides/vm/signal-chaining.html I tried to compile Open MPI using an external libevent (2.0-5) without success. What do you think?

goodell commented 9 years ago

I'm not sure I understand how libevent would be interfering, exactly, nor do I see how the signal chaining mechanism would cause a SEGV like this. Can you explain your theory of how signal chaining is involved with a bit more detail?

griznog commented 9 years ago

We are seeing similar behavior with a simple test program that just does a multiply and add in a loop. Testing with different JVMs and OpenMPI 1.8.4 and 1.8.5rc3 we see:

Oracle 1.8.0: segfault Oracle 1.7.0: segfault OpenJDK 1.8.0: segfault OpenJDK 1.6.0: works (or, hasn't segfaulted yet).

I can post our code and segfault messages, but they have less info than what I have seen here.

hppritcha commented 9 years ago

Hi John

Does the segfault seem to occur after mpi finalize is called in the app? If you would not mind posting the app and the segfault message that would help.

Howard On Apr 29, 2015 1:12 AM, "John Hanks" notifications@github.com wrote:

We are seeing similar behavior with a simple test program that just does a multiply and add in a loop. Testing with different JVMs and OpenMPI 1.8.4 and 1.8.5rc3 we see:

Oracle 1.8.0: segfault Oracle 1.7.0: segfault OpenJDK 1.8.0: segfault OpenJDK 1.6.0: works (or, hasn't segfaulted yet).

I can post our code and segfault messages, but they have less info than what I have seen here.

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/369#issuecomment-97329825.

griznog commented 9 years ago

Howard,

Where the segfault occurs is JVM dependent. The IBM 1.8.0 JDK segfaults before the program starts while the Oracle JDK segfaults after MPI.Init and into the loop of busywork. The test code I am using is:

import mpi.*;
public class MPIHelloWorldLoad {
    public static void main(String[] args) throws MPIException,Exception {
        MPI.Init(args);
        MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
        int size = MPI.COMM_WORLD.getSize();

        int myRank = MPI.COMM_WORLD.getRank();

        System.out.print("Hello world from MPI process " + myRank + " out " +
                "of " + size + " processors.");

        int procTime = 300;
        int count = 0;
        if (args.length > 0) {
            try {
                count = Integer.parseInt(args[0]);
            } catch (NumberFormatException e) {
                System.err.println("Argument" + args[0] + " must be an int.");
                System.exit(1);
            }
        }
    double a = 1.293881 ;
    long start = 0 ;
    for (int i = 0 ; i < count ; i++) {
        try {
                start = System.currentTimeMillis();
            } catch (Exception ex) {
                System.out.println(ex);
            }
        a *= i ;
        a /= 5 ;
            if ( i % 100000 == 0 ) {
                System.out.print(".");
            }
    }
        System.out.println("and let's MPI.Finalize this.");
        MPI.Finalize();
    }
}

Running this with different values for the loops max index:

hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad
Hello world from MPI process 0 out of 1 processors.and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 10
Hello world from MPI process 0 out of 1 processors..and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 100
Hello world from MPI process 0 out of 1 processors..and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 1000
Hello world from MPI process 0 out of 1 processors..and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 10000
Hello world from MPI process 0 out of 1 processors..and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 100000
Hello world from MPI process 0 out of 1 processors..and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 1000000
Hello world from MPI process 0 out of 1 processors...........and let's MPI.Finalize this.
[hanksj@db814-02-6:lee_chuck]$ mpirun java MPIHelloWorldLoad 10000000
Hello world from MPI process 0 out of 1 processors......................................................--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 28071 on node db814-02-6 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

However, if I repeat this many times it doesn't always segfault for any particular value. All the length of the loop seems to do is keep things busy long enough for a more or less random segfault to occur.

The backtrace for the core file from the above segfault is:

Core was generated by `java -cp /home/hanksj/lee_chuck:/home/hanksj/Applications/software/openmpi/v1.8'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b2f19e9b1d0 in ?? ()
(gdb) bt
#0  0x00002b2f19e9b1d0 in ?? ()
#1  <signal handler called>
#2  0x00002b2df86177f3 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb)

jbh

jsquyres commented 9 years ago

@hppritcha Do you think you'll be able to have this fixed by v1.10.0?

hppritcha commented 9 years ago

Ill give it a shot.


sent from my smart phonr so no good type.

Howard On Jul 25, 2015 6:26 AM, "Jeff Squyres" notifications@github.com wrote:

@hppritcha https://github.com/hppritcha Do you think you'll be able to have this fixed by v1.10.0?

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/369#issuecomment-124840918.

ggouaillardet commented 9 years ago

per http://www.open-mpi.org/community/lists/users/2015/08/27465.php i could get rid of the problem by not using the psm mtl mpirun --mca mtl ^psm -np 2 java MPITestBroke

ggouaillardet commented 9 years ago

the root cause is recvBuf is freed by the garbage collector in the middle of the main loop. I will fix that tomorrow by having the java persistent request pointing to the java buffer. as a workaround, simply pass recvBuf to a dummy subroutine before mpi_finalize.

hppritcha commented 9 years ago

HI Gilles,

I hope this is not true, or that I don't understand what you mean. The recvBuf in the example i'm working with the MPI newIntBuffer method to allocate the buffer. This is suppose to be allocating a direct byte buffer.

My understanding of the way direct byte buffer is suppose to work is that it allocates memory out of a heap that is not eligible for cleanup by the gc like normal allocations are.

Almost all of the Open MPI java is predicated on this assumption. Except for blocking send/recv, one is required to use these direct byte buffers.

But maybe I'm misunderstanding what you mean?

Howard

2015-08-17 5:08 GMT-06:00 Gilles Gouaillardet notifications@github.com:

the root cause is recvBuf is freed by the garbage collector in the middle of the main loop. I will fix that tomorrow by having the java persistent request pointing to the java buffer. as a workaround, simply pass recvBuf to a dummy subroutine before mpi_finalize.

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/369#issuecomment-131782601.

ggouaillardet commented 9 years ago

@hppritcha i think you understood correctly what i meant

per the java doc

The contents of direct buffers may reside outside of the normal garbage-collected heap

so unless i am misreading English (which happens quite often ...) this is equivalent to

The contents of direct buffers may reside inside of the normal garbage-collected heap,
and might be freed by the garbage collector when there are no more references to the
direct buffer

i made PR #815 and it solves the issue for me, as long as i run with --mca mtl ^psm but this is a different story. strictly speaking, we could consider this is a bug on the user side. but since this is very tricky to debug, i'd rather have ompi handle this case transparently from the enduser point of view.

generally speaking, i am wondering whether this fix is enough. For example, could the garbage collector free a buffer after a call to MPI_Isend and before the message is sent ? same thing for MPI_Irecv (even if it is generally dumb to recv a message and never check its content ...)

bottom line, i think we might have to keep a pointer to the buffer in the Request class instead of the Prequest class.

any idea ?

in the mean time, can you confirm the test runs just fine on hopper with this PR ?

osvegis commented 9 years ago

Thanks Gilles. I tested PR #815 and it also solves the issue for me. I think is better to hold a reference to the buffer in the Request class.

jsquyres commented 9 years ago

@ggouaillardet your explanation makes sense to me, but as @osvegis can attest, I am far from a Java expert. :smile:

hppritcha commented 9 years ago

I think what we will probably do is rewrite the direct buffer allocator methods in the MPI class to actually use a native method to create the buffer object. That plus using the NewGlobalRef method should prevent the gc from moving/deleting any of the buffers allocated for MPI calls. That hopefully will also solve some of the problems we are seeing trying to reproduce @osvegis results in the ompi java paper.

osvegis commented 9 years ago

Direct buffers never are moved. They reside outside the Java heap, but when there are not any reference to them, then they are destroyed by GC. If we put a reference in Request class, the buffer won't be destroyed. I think the @ggouaillardet suggestion is better.

nrgraham23 commented 9 years ago

I am going to work on a solution that solves the problem by allocation the buffers on the C side as @hppritcha suggested to see if this also fixes the poor performance we have been seeing. Ill create a PR with the changes when I complete them.

osvegis commented 9 years ago

@hppritcha and @nrgraham23, You are wrong. The only difference you'll get is that users must deallocate direct buffers manually. Java already allocates buffers natively as you want.

hppritcha commented 9 years ago

Okay, I don't want to use the patch from Gilles as is. It is only a band aid for one test. There are many more places, basically all the non-blocking pt2pt and collectives, where the same problem effectively exists. Also, for some of the functions, like iGatherv, there are two buffers associated with the request. I think we've just been getting lucky with the way java bindings have been used that we've not seen this problem elsewhere.

nrgraham23 commented 9 years ago

Howard and I discussed a more robust solution that uses the idea Gilles suggested. We plan to use an array list to store the buffers so we can store a variable number of buffers, and we will move it to the Request class so the non-blocking pt2pt and collective operations can store the buffers as well. Additionally, instead of modifying or making new constructors, we will add a single method that will need to be called to add the buffers to the array list.

Ill try to get this PR up tonight so it can be discussed.