IMB-RMA failure on POWER

loveshack commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack install openmpi@4.0.5 +cuda +cxx +legacylaunchers +lustre fabrics=cma,knem,ucx schedulers=slurm

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: RHEL 7.6 ppc64le
Computer hardware: IBM AC922 (like Summit/Sierra)
Network type: EDR IB

Details of the problem

IMB-RMA crashes at the start like this. A similar build on x86_64 runs, as does spectrum-mpi 10.3 on this system.

Is this known to work on Summit?

# Truly_passive_put
#     The benchmark measures execution time of MPI_Put for 2 cases:
#     1) The target is waiting in MPI_Barrier call (t_pure value)
#     2) The target performs computation and then enters MPI_Barrier routine (t_ovrl value)
[gpu027:91080:0:91080] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:  91080) ====
=================================
[gpu027:91080] *** Process received signal ***
[gpu027:91080] Signal: Segmentation fault (11)
[gpu027:91080] Signal code:  (-6)
[gpu027:91080] Failing at address: 0x262292ca000163c8
[gpu027:91080] [ 0] [0x2000000504d8]
[gpu027:91080] [ 1] /users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/openmpi-4.0.5-6sqv24vyrwc5nerb7y5fslqnf5jrnjv6/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_lock_atomic+0x94)[0x200014d19fb4]
[gpu027:91080] [ 2] /users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/openmpi-4.0.5-6sqv24vyrwc5nerb7y5fslqnf5jrnjv6/lib/libmpi.so.40(MPI_Win_lock+0x138)[0x2000001835a8]
[gpu027:91080] [ 3] IMB-RMA(IMB_rma_single_put+0x17c)[0x100d3d18]
[gpu027:91080] [ 4] IMB-RMA(_ZN11Bmark_descr21IMB_init_buffers_iterEP9comm_infoP13iter_scheduleP5BenchP5cmodeii+0xce0)[0x100a9ac8]
[gpu027:91080] [ 5] IMB-RMA(_ZN17OriginalBenchmarkI14BenchmarkSuiteIL17benchmark_suite_t3EEXadL_Z18IMB_rma_single_putEEE3runERK10scope_item+0x398)[0x100ab1a0]
[gpu027:91080] [ 6] IMB-RMA(main+0x19b0)[0x10060aa4]
[gpu027:91080] [ 7] /lib64/libc.so.6(+0x25200)[0x200000645200]
[gpu027:91080] [ 8] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000006453f4]
[gpu027:91080] *** End of error message ***
--------------------------------------------------------------------------

loveshack commented 3 years ago

I realize I didn't say that's with IMB-v2019.6, in case it makes a difference, and if the backtrace from the C version is more helpful, rather than what gets built at the top level, it's:

[gpu026:26730] Signal: Segmentation fault (11)
[gpu026:26730] Signal code:  (-6)
[gpu026:26730] Failing at address: 0x262292ca0000686a
[gpu026:26730] [ 0] [0x2000000504d8]
[gpu026:26730] [ 1] /users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/openmpi-4.0.5-6sqv24vyrwc5nerb7y5fslqnf5jrnjv6/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_lock_atomic+0x94)[0x200014d59fb4]
[gpu026:26730] [ 2] /users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/openmpi-4.0.5-6sqv24vyrwc5nerb7y5fslqnf5jrnjv6/lib/libmpi.so.40(MPI_Win_lock+0x138)[0x2000001435a8]
[gpu026:26730] [ 3] IMB-RMA(IMB_rma_single_put+0x17c)[0x1000f4e8]
[gpu026:26730] [ 4] IMB-RMA(IMB_init_buffers_iter+0xa18)[0x1000a0bc]
[gpu026:26730] [ 5] IMB-RMA(main+0x5b8)[0x100047d4]
[gpu026:26730] [ 6] /lib64/libc.so.6(+0x25200)[0x2000002a5200]
[gpu026:26730] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000002a53f4]

devreal commented 3 years ago

Unfortunately, I don't have access to a POWER system. Does that happen when running on a single node? If you are willing to dig a little further, you could configure OMPI with --enable-debug to enable some more sanity checks internally. If you set CFLAGS="-O0 -g" during configure you can get complete backtraces from gdb/ddt and inspect some of the surrounding variables at the time the segfault happens, maybe you can spot something interesting/suspicious. A full backtrace from a debugger (with line numbers) may already give a hint to what is happening.

wlepera commented 3 years ago

@loveshack I've built OMPI 4.0.5 for ppc64le and have tried recreating on a single P9 node and across two nodes (running RHEL 7.6), using IMB-RMA from the 2019.6 Intel benchmarks, but have not seen the problem so far. Could you please up date the issue with the following:

configure command line used to build OMPI
launch command used to start the test
task geometry (number of nodes, ranks per node)

Thanks

loveshack commented 3 years ago

Unfortunately, I don't have access to a POWER system. Does that happen when running on a single node? If you are willing to dig a little further, you could configure OMPI with --enable-debug to enable some more sanity checks internally. If you set CFLAGS="-O0 -g" during configure you can get complete backtraces from gdb/ddt and inspect some of the surrounding variables at the time the segfault happens, maybe you can spot something interesting/suspicious. A full backtrace from a debugger (with line numbers) may already give a hint to what is happening.

That doesn't help, but I found out how to get a core dump on that system, and it fails at line 245

OBJ_RETAIN(peer);

with peer == 0. I haven't figured out why yet, which will probably be hard work as I'm not familiar with the code and I'm unable to use an interactive session.

loveshack commented 3 years ago

configure command line used to build OMPI

$ ompi_info|grep Configure\ command
  Configure command line: '--prefix=/users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/openmpi-4.0.5-6sqv24vyrwc5nerb7y5fslqnf5jrnjv6' '--enable-shared' '--disable-silent-rules' '--disable-builtin-atomics' '--with-pmi=/users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/slurm-19-05-6-1-v7o77d3qbebykh43rzwamk5bql5rynzo' '--with-zlib=/usr' '--enable-mpi1-compatibility' '--without-verbs' '--without-fca' '--without-psm2' '--with-cma' '--with-knem=/users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/knem-1.1.4-yymzr2ix6qhpzczsky3bkkw7si64mtxf' '--without-ofi' '--without-mxm' '--without-hcoll' '--without-xpmem' '--without-psm' '--with-ucx=/users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/ucx-1.9.0-3ovv5gnvh7bygwetkm7peiwrvtv2euxo' '--without-loadleveler' '--with-slurm' '--without-tm' '--without-sge' '--without-lsf' '--without-alps' '--disable-memchecker' '--with-lustre=/usr' '--with-hwloc=/users/***/spack/opt/spack/linux-rhel7-power9le/gcc-8.4.0/hwloc-2.2.0-l6whjl5phucord2rphfa22eh7yg764ry' '--disable-java' '--disable-mpi-java' '--enable-dlopen' '--with-cuda=/opt/software/apps/libs/CUDA/10.2.89' '--enable-wrapper-rpath' '--disable-wrapper-runpath' '--enable-mpi-cxx' '--disable-cxx-exceptions' '--with-wrapper-ldflags=-rdynamic -Wl,-rpath,/users/***/spack/opt/spack/linux-rhel7-power8le/gcc-4.8.5/gcc-8.4.0-jih3wzjgydz4muoy7zvxa2brcbwxmcbe/lib/gcc/powerpc64le-unknown-linux-gnu/8.4.0 -Wl,-rpath,/users/***/spack/opt/spack/linux-rhel7-power8le/gcc-4.8.5/gcc-8.4.0-jih3wzjgydz4muoy7zvxa2brcbwxmcbe/lib64'

[paths redacted]

launch command used to start the test

mpirun --map-by node --bind-to core -n 64 IMB-RMA

task geometry (number of nodes, ranks per node)

Thanks

That's two full nodes, of 32 cores each. It runs with only two ranks, fails with more. Also runs with four ranks on a single node.

Thanks for any insight.

hjelmn commented 3 years ago

This is unlikely to be machine specific. I can try to reproduce this on a Cray system tomorrow.

wlepera commented 3 years ago

recreated with np = 4 across two P9 nodes. Verified issue will not recreate with 2 processes.

#0  0x00002000145bbb9c in ompi_osc_rdma_lock_atomic () from /smpi_dev/lepera/ompi_8102/local/lib/openmpi/mca_osc_rdma.so
#1  0x00002000001389a4 in PMPI_Win_lock () from /smpi_dev/lepera/ompi_8102/local/lib/libmpi.so.40
#2  0x000000001000e288 in IMB_rma_single_put ()
#3  0x0000000010008944 in IMB_init_buffers_iter ()
#4  0x0000000010002a54 in main ()

Stack trace looks similar to yours. Building debug version of OMPI

wlepera commented 3 years ago

This looks like an incompatibility between the UCX PML and the RDMA OSC components, which are selected by default. As a workaround, you can specify the UCX OSC component (add "--mca osc ucx" to the mpirun command line).

As previously noted, the segfault occurs in the OBJ_RETAIN macro because the "peer" variable passed to it is NULL. The origin of this is in the ompi_osc_rdma_peer_btl_endpoint function, which is looping through the available BTLs trying to find the endpoint that matches the selected value. From osc_rdma_peer.c:

struct mca_btl_base_endpoint_t *ompi_osc_rdma_peer_btl_endpoint (struct ompi_osc_rdma_module_t *module, int peer_id)
{
    ompi_proc_t *proc = ompi_comm_peer_lookup (module->comm, peer_id);
    mca_bml_base_endpoint_t *bml_endpoint;
    int num_btls;

    /* for not just use the bml to get the btl endpoint */
    bml_endpoint = mca_bml_base_get_endpoint (proc);

    num_btls = mca_bml_base_btl_array_get_size (&bml_endpoint->btl_rdma);

    for (int btl_index = 0 ; btl_index < num_btls ; ++btl_index) {
        if (bml_endpoint->btl_rdma.bml_btls[btl_index].btl == module->selected_btl) {
            return bml_endpoint->btl_rdma.bml_btls[btl_index].btl_endpoint;
        }
    }

    /* very unlikely. if this happened the btl section process is broken */
    return NULL;
}

I added tracing to this function, which shows that the for loop iterates through all the btl's before exiting and returning NULL. I also added an assert(0) just before the return NULL statement, and observed the following from the resulting core dump:

(gdb) up 4
#4  0x00002000148c888c in ompi_osc_rdma_peer_btl_endpoint (module=0x50084900, peer_id=0) at osc_rdma_peer.c:56
56  assert(0);
(gdb) p num_btls
$1 = 6
(gdb) p bml_endpoint->btl_rdma.bml_btls[0].btl
$2 = (struct mca_btl_base_module_t *) 0x4fa8f600
(gdb) p bml_endpoint->btl_rdma.bml_btls[1].btl
$3 = (struct mca_btl_base_module_t *) 0x4fa30430
(gdb) p bml_endpoint->btl_rdma.bml_btls[2].btl
$4 = (struct mca_btl_base_module_t *) 0x4fa2fdf0
(gdb) p bml_endpoint->btl_rdma.bml_btls[3].btl
$5 = (struct mca_btl_base_module_t *) 0x4fa7f520
(gdb) p bml_endpoint->btl_rdma.bml_btls[4].btl
$6 = (struct mca_btl_base_module_t *) 0x4fa33c80
(gdb) p bml_endpoint->btl_rdma.bml_btls[5].btl
$7 = (struct mca_btl_base_module_t *) 0x4fa7b310
(gdb) p module->selected_btl
$8 = (struct mca_btl_base_module_t *) 0x200002680210 <mca_btl_vader>

The btl's in the array don't map to any named btl, and the pointers appear to be in a different address range. Not sure if this is significant, though.

@jjhursey examined the selection code, and the RDMA OSC component's priority is higher than the UCX OSC component's priority, which is why the RDMA component is used by default.

Combinations that worked:

UCX PML and UCX OSC
OB1 PML and RDMA OSC

OMPI was built with UCX version 1.7.0.1

@jladd-mlnx, @artpol84, are either of you aware of any incompatibility between the UCX PML and the RDMA OSC components? Do you expect them to work together?

hjelmn commented 3 years ago

That is odd. Not sure why the bml endpoints are getting into an inconsistent state. Will have to take a look. This is a situation where osc/ucx should win because no btl is available because btl/uct is not enabled by default. Can you run a debug build (configure with --enable-debug) and add --mca btl_base_verbose max --mca bml_base_verbose max --mca osc_base_verbose max. This will give us me idea of what is going wrong.

wlepera commented 3 years ago

@hjelmn:

I am including output files from two debug runs:

mpirun --prefix /smpi_dev/lepera/ompi_8102/local -np 4 -H f10n17:2,f10n18:2 --map-by node --bind-to core --tag-output --mca btl_base_verbose max --mca bml_base_verbose max --mca osc_base_verbose max ../../../../../mpi-benchmarks-IMB-v2019.6/src_c/IMB-RMA

This run fails; refer to the output file 8102_fail.out. Note that this failure run was made with my debug code that asserts right before the ompi_osc_rdma_peer_btl_endpoint() function erroneously returns NULL

Note also that if we remove the "--map-by node" option, the test will run to completion (this is the second run):

mpirun --prefix /smpi_dev/lepera/ompi_8102/local -np 4 -H f10n17:2,f10n18:2 --bind-to core --tag-output --mca btl_base_verbose max --mca bml_base_verbose max --mca osc_base_verbose max ../../../../../mpi-benchmarks-IMB-v2019.6/src_c/IMB-RMA

Refer to the output file 8102_pass.out

Finally, I am including my modified copy of osc_rdma_peer.c, so you can match up the debug tracing I added to the source.

8102_11042020_tar.gz

loveshack commented 3 years ago

I found I can avoid this with btl ^uct following the UCX doc, at least with OMPI 4.1, which still builds and uses it by default.

bosilca commented 3 years ago

Something is weird with your install. The configure command posted here specifically points to UCX 1.9 as the underlying UCX stack, but the UCT BTL prohibit anything higher than 1.8 to be used. I wonder how the UCT BTL got compiled in.

loveshack commented 3 years ago

I'm increasingly confused.

It does, indeed, seem the btl doesn't get built against ucx 1.9. I must have been going by the fact that the ucx doc says you have to disable it, and makes no mention of version dependence; that and the fact that the failure looked consistent with memory corruption, and btl ^uct apparently cured it. Apologies for not checking properly. Perhaps someone can get the ucx doc fixed.

I've been able to work on it again and now find I can't reproduce ^uct solving the problem. It just fails, seemingly reproducibly, in ompi_osc_rdma_lock_atomic. (Perhaps it isn't really reproducible and it was fluke when it worked.)

I also tried with ucx 1.8.1, with the same result.

Sorry about the general confusion.

jsquyres commented 3 years ago

@loveshack Just to be sure: are you installing one build of Open MPI over the prior build of Open MPI? Or are you always installing into a clean, new, empty directory tree?

I.e., I'm wondering if you built with an older UCX at one point, and therefore the Open MPI UCT BTL was built and installed. But then you re-built Open MPI with a newer UCX, and the Open MPI UCT BTL was not built... but you installed to the same tree, and therefore the previously-installed UCT BTL was still there.

loveshack commented 3 years ago

@loveshack Just to be sure: are you installing one build of Open MPI over the prior build of Open MPI? Or are you always installing into a clean, new, empty directory tree?

I was using spack (with packaging from an openmpi maintainer, I think), which should mean the latter.

I.e., I'm wondering if you built with an older UCX at one point, and therefore the Open MPI UCT BTL was built and installed.

No, I didn't. After several rebuilds I don't have the installation I was using, and I've no evidence the BTL was ever built. I just wanted to say what I reported seems to have been wrong and should be disregarded without being able to account for it. I'm sorry for not being able to get things like this straight.

jsquyres commented 3 years ago

@loveshack Should I interpret your last comment to mean that we can close this issue (and #8379)?

loveshack commented 3 years ago

@loveshack Should I interpret your last comment to mean that we can close this issue (and #8379)?

No. The failure is still there, as originally reported by me and on Summit. What I can't reproduce is the fix I thought I had by following what the UCX doc says (which turns out to be misleading). The problem appears to be in UCX, but the openmpi issue should stay open until UCX is fixed, particularly as it contains the original workaround for the failure.

yosefe commented 3 years ago

@loveshack does setting "-mca osc ucx" still workaround the original failure?

loveshack commented 3 years ago

@loveshack does setting "-mca osc ucx" still workaround the original failure?

I don't know what "still" means -- with a different ucx or openmpi? I don't think there's any difference between openmpi 4.0 and 4.1, at least.

ggouaillardet commented 3 years ago

@loveshack let me try it this way:

are you able to reproduce the issue with -mca osc ucx on at least one combination of Open MPI and UCX? (and if yes, state which one and include a new stack trace)

loveshack commented 3 years ago

Just ignore everything I said after btl ^uct.

The workaround works in that case, as reported, but I don't remember what it changed in mpich test suite failures. I assume IBM have sorted this in Spectrum MPI, but that doesn't help us.

wlepera commented 3 years ago

In Spectrum MPI, the default PML and OSC components are PAMI. Using SMPI 10.3.1, if I set -mca pml ucx and -mca osc rdma, (simulating the OMPI defaults), I see the same failure.

awlauria commented 3 years ago

Is there any progress on this issue?

open-mpi / ompi