open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

shmem_alltoall32() segfaulted when test run on two nodes #8120

Open thanh-lam opened 3 years ago

thanh-lam commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

[/usr/mpi/gcc]> rpm -qa | grep -i openmpi
openmpi-4.0.3rc4-1.49017.ppc64le
mpitests_openmpi-3.2.20-e1a0676.49017.ppc64le

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Open MPI was installed as part of the Mellanox software distribution.

[/usr/mpi/gcc/openmpi-4.0.3rc4]> ls
bin  doc  etc  include  lib64  share  tests

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

This is a simple test program that calls shmem_alltoall32() for data communication. The test can be varied with different number of tasks and nodes.

Program source is available if needed. It's essentially does the following:

    pe_size = n_pes;
    elms_per_task = BUF_LEN/ sizeof(int) / pe_size;
    for( size_t i=0; i<SHMEM_ALLTOALL_SYNC_SIZE; i++ )  
        psync[i] = SHMEM_SYNC_VALUE;

    shmem_barrier_all();    

    if ( rank < pe_size ) {
      shmem_alltoall32( (void*)x32bit_buffer_dest, (void*)x32bit_buffer_source, 
                    elms_per_task, 0, 0, pe_size, psync );
    }

    shmem_barrier_all();

The oshrun command is as following: (To run all 32 tasks on one node)

oshrun -verbose -host <host>:32  -np 32 -N 32 shmem_alltoall -s 131072

That completed with no issue. But, if I added a second node and ran 31 tasks on the first and 1 task on the second, the test segfaulted.

oshrun -verbose -host <host1>:32,<host2>:32  -np 32 -N 31 shmem_alltoall -s 131072
...
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun noticed that process rank 31 with PID 2045122 on node f2n06 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

That produced the following gdb trace stack on the second node:

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from shmem_alltoall...done.
[New LWP 2046595]
[New LWP 2046597]
[New LWP 2046604]
[New LWP 2046596]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".
Core was generated by `shmem_alltoall -s 131072 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000020001fc565dc in mca_spml_ucx_get_mkey (module=0x20001fc703c8 <mca_spml_ucx>, rva=<synthetic pointer>, 
    va=0x4af2cd40, pe=<optimized out>, ctx=0x20001fc70598 <mca_spml_ucx_ctx_default>)
    at ../../../../oshmem/mca/spml/ucx/spml_ucx.h:235
235 ../../../../oshmem/mca/spml/ucx/spml_ucx.h: No such file or directory.
[Current thread is 1 (Thread 0x200000046c00 (LWP 2046595))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-101.el8.ppc64le libibcm-41mlnx1-OFED.4.1.0.1.0.49017.ppc64le libibumad-43.1.1.MLNX20200211.078947f-0.1.49017.ppc64le libibverbs-41mlnx1-OFED.4.9.0.0.7.49017.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.49017.ppc64le libmlx5-41mlnx1-OFED.4.9.0.1.2.49017.ppc64le libnl3-3.5.0-1.el8.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.49017.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.49017.ppc64le numactl-libs-2.0.12-9.el8.ppc64le ucx-1.8.0-1.49017.ppc64le ucx-cma-1.8.0-1.49017.ppc64le ucx-ib-1.8.0-1.49017.ppc64le ucx-ib-cm-1.8.0-1.49017.ppc64le ucx-knem-1.8.0-1.49017.ppc64le ucx-rdmacm-1.8.0-1.49017.ppc64le zlib-1.2.11-16.el8_2.ppc64le
(gdb) bt
#0  0x000020001fc565dc in mca_spml_ucx_get_mkey (module=0x20001fc703c8 <mca_spml_ucx>, rva=<synthetic pointer>, 
    va=0x4af2cd40, pe=<optimized out>, ctx=0x20001fc70598 <mca_spml_ucx_ctx_default>)
    at ../../../../oshmem/mca/spml/ucx/spml_ucx.h:235
#1  mca_spml_ucx_get (ctx=0x20001fc70598 <mca_spml_ucx_ctx_default>, src_addr=0x4af2cd40, size=8, 
    dst_addr=0x7fffc32c0c80, src=<optimized out>) at spml_ucx.c:824
#2  0x000020001fcb1f74 in _algorithm_recursive_doubling (group=group@entry=0x4aa15b30, 
    pSync=pSync@entry=0x4af2cd40) at scoll_basic_barrier.c:387
#3  0x000020001fcb227c in _algorithm_adaptive (pSync=<optimized out>, group=<optimized out>)
    at scoll_basic_barrier.c:581
#4  mca_scoll_basic_barrier (group=0x4aa15b30, pSync=0x4af2cd40, alg=5) at scoll_basic_barrier.c:75
#5  0x000020001fcb7e04 in mca_scoll_basic_alltoall (group=0x4aa15b30, target=<optimized out>, 
    source=0xff0000d8, dst=<optimized out>, sst=<optimized out>, nelems=40960, element_size=<optimized out>, 
    pSync=0x4af2cd40, alg=-1) at scoll_basic_alltoall.c:87
#6  0x00002000000e684c in _shmem_alltoall (pSync=0x4af2cd40, PE_size=32, logPE_stride=<optimized out>, 
    PE_start=0, element_size=4, nelems=40960, sst=1, dst=1, source=0xff0000d8, target=0xff5000e0)
    at pshmem_alltoall.c:86
#7  pshmem_alltoall32 (target=0xff5000e0, source=0xff0000d8, nelems=40960, PE_start=<optimized out>, 
    logPE_stride=<optimized out>, PE_size=<optimized out>, pSync=0x4af2cd40) at pshmem_alltoall.c:108
#8  0x0000000010001038 in main (argc=3, argv=0x7fffc32c14e8) at shmem_alltoall.c:84
(gdb) 
thanh-lam commented 3 years ago

Hello @jladd-mlnx , could you take a look? Thanks!

jladd-mlnx commented 3 years ago

@thanh-lam - hi, we will take a look at this. @janjust - please take a look.

janjust commented 2 years ago

@thanh-lam Is this still an issue? I tried reproducing with the following, but on an x86 system [1]

FWIW I tried v4.0.x, v4.1.x , and v5.x but no segfault.

$oshrun --version
oshrun (OpenRTE) 4.0.7rc1

Report bugs to http://www.open-mpi.org/community/help/
tomislavj@helios010:/global/scratch/users/tomislavj/oshmem/tests
$ucx_info -v
# UCT version=1.12.1 revision dc92435
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --prefix=/global/scratch/users/tomislavj/oshmem/build-ucx/install --without-java
tomislavj@helios010:/global/scratch/users/tomislavj/oshmem/tests
$cat shmem_barrier.c
#include <stdio.h>
#include <stdlib.h>
#include <shmem.h>

#define BUF_LEN (1 * 1024 * 1024)

long pSync[SHMEM_ALLTOALL_SYNC_SIZE];

int main(void)
{
        int me, pe_size, elms_per_task;

    uint32_t *dst, *src;
        size_t i;

        shmem_init();

        pe_size = num_pes();
        me = my_pe();

        dst = shmem_malloc(BUF_LEN);
        src = shmem_malloc(BUF_LEN);

        elms_per_task = BUF_LEN / sizeof(int) / pe_size;

        for(i=0; i<SHMEM_ALLTOALL_SYNC_SIZE; i++ ) {
                pSync[i] = SHMEM_SYNC_VALUE;
        }

        shmem_barrier_all();

        if ( me < pe_size ) {
                shmem_alltoall32( (void*)dst, (void*)src,
                                elms_per_task, 0, 0, pe_size, pSync );
        }

        shmem_barrier_all();

        return 0;
}
thanh-lam commented 2 years ago

Hello @janjust ,

I just checked the version of oshrun on my systems and it is:

[/usr/mpi/gcc/openmpi-4.0.3rc4/bin]> ./oshrun --version
oshrun (OpenRTE) 4.0.3rc4

Report bugs to http://www.open-mpi.org/community/help/

Can I try to verify with this level or should wait for 4.0.7rc1?

Thanks!

janjust commented 2 years ago

Can you try with v4.0.7? I'll try with 4.0.3.rc3, see if it'll reproduce for me