open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

COMM_SPAWN broken on Solaris/v1.10 #1569

Closed jsquyres closed 7 years ago

jsquyres commented 8 years ago

@siegmargross posted that COMM_SPAWN tests on master on Solaris/SPARC are broken in http://www.open-mpi.org/community/lists/users/2016/04/28984.php.

Here's this text programs (had to add .txt to the filenames to upload them here in this Github issue, sorry):


Hi,

I have built openmpi-v1.10.2-142-g5cd9490 on my machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. Unfortunately I get runtime errors for some programs.

Sun C 5.13

tyr spawn 116 mpiexec -np 1 --host tyr,sunpc1,linpc1,linpc1,ruester spawn_master

Parent process 0 running on tyr.informatik.hs-fulda.de
 I create 4 slave processes

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (proc_pointer))->obj_magic_id, file ../../openmpi-v1.10.2-142-g5cd9490/ompi/group/group_init.c, line 215, function ompi_group_increment_proc_count
[ruester:10077] *** Process received signal ***
[ruester:10077] Signal: Abort (6)
[ruester:10077] Signal code:  (-1)
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:opal_backtrace_print+0x1c
/usr/local/openmpi-1.10.3_64_cc/lib64/libopen-pal.so.13.0.2:0x1b10f0
/lib/sparcv9/libc.so.1:0xd8c28
/lib/sparcv9/libc.so.1:0xcc79c
/lib/sparcv9/libc.so.1:0xcc9a8
/lib/sparcv9/libc.so.1:__lwp_kill+0x8 [ Signal 2091943080 (?)]
/lib/sparcv9/libc.so.1:abort+0xd0
/lib/sparcv9/libc.so.1:_assert_c99+0x78
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_group_increment_proc_count+0x10c
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0xe758
/usr/local/openmpi-1.10.3_64_cc/lib64/openmpi/mca_dpm_orte.so:0x113d4
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:ompi_mpi_init+0x188c
/usr/local/openmpi-1.10.3_64_cc/lib64/libmpi.so.12.0.3:MPI_Init+0x26c
/home/fd1026/SunOS/sparc/bin/spawn_slave:main+0x18
/home/fd1026/SunOS/sparc/bin/spawn_slave:_start+0x108
[ruester:10077] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 0 on node ruester exited on signal 6 (Abort).
--------------------------------------------------------------------------

With GCC-5.1.0:

tyr spawn 129 mpiexec -np 1 --host ruester,ruester,sunpc1,linpc1,linpc1 spawn_master

Parent process 0 running on ruester.informatik.hs-fulda.de
 I create 4 slave processes

[ruester.informatik.hs-fulda.de:09823] [[60617,1],0] ORTE_ERROR_LOG: Unreachable in file ../../../../../openmpi-v1.10.2-142-g5cd9490/ompi/mca/dpm/orte/dpm_orte.c at line 523
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

 Process 1 ([[60617,1],0]) is on host: ruester
 Process 2 ([[0,0],0]) is on host: unknown!
 BTLs attempted: tcp self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[ruester:9823] *** An error occurred in MPI_Comm_spawn
[ruester:9823] *** reported by process [3972595713,0]
[ruester:9823] *** on communicator MPI_COMM_WORLD
[ruester:9823] *** MPI_ERR_INTERN: internal error
[ruester:9823] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ruester:9823] ***    and potentially your MPI job)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

 Process name: [[60617,1],0]
 Exit code:    17
--------------------------------------------------------------------------
tyr spawn 130
tyr spawn 133 mpiexec -np 1 --host tyr,sunpc1,linpc1,ruester spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de
 I create 3 slave processes.

Assertion failed: OPAL_OBJ_MAGIC_ID == ((opal_object_t *) (proc_pointer))->obj_magic_id, file ../../openmpi-v1.10.2-142-g5cd9490/ompi/group/group_init.c, line 215, function ompi_group_increment_proc_count
[ruester:09954] *** Process received signal ***
[ruester:09954] Signal: Abort (6)
[ruester:09954] Signal code:  (-1)
/usr/local/openmpi-1.10.3_64_gcc/lib64/libopen-pal.so.13.0.2:opal_backtrace_print+0x2c
/usr/local/openmpi-1.10.3_64_gcc/lib64/libopen-pal.so.13.0.2:0xc2c0c
/lib/sparcv9/libc.so.1:0xd8c28
/lib/sparcv9/libc.so.1:0xcc79c
/lib/sparcv9/libc.so.1:0xcc9a8
/lib/sparcv9/libc.so.1:__lwp_kill+0x8 [ Signal 6 (ABRT)]
/lib/sparcv9/libc.so.1:abort+0xd0
/lib/sparcv9/libc.so.1:_assert_c99+0x78
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12.0.3:ompi_group_increment_proc_count+0xf0
/usr/local/openmpi-1.10.3_64_gcc/lib64/openmpi/mca_dpm_orte.so:0x6638
/usr/local/openmpi-1.10.3_64_gcc/lib64/openmpi/mca_dpm_orte.so:0x948c
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12.0.3:ompi_mpi_init+0x1978
/usr/local/openmpi-1.10.3_64_gcc/lib64/libmpi.so.12.0.3:MPI_Init+0x2a8
/home/fd1026/SunOS/sparc/bin/spawn_slave:main+0x10
/home/fd1026/SunOS/sparc/bin/spawn_slave:_start+0x7c
[ruester:09954] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node ruester exited on signal 6 (Abort).
--------------------------------------------------------------------------
tyr spawn 134
rhc54 commented 8 years ago

@ggouaillardet Can you try to replicate? I have no access to this environment.

ggouaillardet commented 8 years ago

will do. fwiw, ompi is configure'd with --enable-heterogeneous this is very lightly tested on the v1.10 branch, especially in an heterogeneous cluster. I have my own branch that fixes it all (so far) but for master only. I hope I can find the time to clean and push it and pr to v2.x

rhc54 commented 8 years ago

@ggouaillardet Where does this stand?

jsquyres commented 7 years ago

@ggouaillardet ping

ggouaillardet commented 7 years ago

i will try to find some time for it. note i do not have solaris/sparc (but i have solaris/x86_64 and linux/sparc)

rhc54 commented 7 years ago

@ggouaillardet Did your fix ever get into master? @siegmargross is reporting a similar problem on v2.0, so I'm guessing there is a common problem across all the releases that relates to hetero clusters.

rhc54 commented 7 years ago

this has either been fixed (and the issue not closed), or it will not be fixed for the 1.10 series