IMB-RMA test crash when using master branch and btl/ofi

wzamazon commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

master branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from git clone, then configured with


./configure --prefix=XXX --with-ofi=/fsx/ALinux2/opt/libfabric/ --disable-man-pages --with-pmix=internal

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

e85b814db68e46d1a9daa30a89d9e00f964fbd05 3rd-party/openpmix (v1.1.3-3095-ge85b814d) fe0cc05e9cf7ff4b49565afcc334937d7e0b995b 3rd-party/prrte (psrvr-v2.0.0rc1-3983-gfe0cc05e9c)

Please describe the system on which you are running

Operating system/version: Amazon Linux2
Computer hardware: Intel Xeon(R) Platinum 8124M CPU @ 3.00GHz
Network type: EFA

Details of the problem

When running IMB-RMA with ompi master branch, the application crashed:

==== starting mpirun --prefix /fsx/ompi/install --wdir results/imb -n 576 --map-by ppr:36:node  --hostfile /fsx/hosts.file -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA -npmin 576 -iter 200 -exclude All_put_all : Sat Aug 14 07:58:34 UTC 2021 ====
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-RMA part  
#------------------------------------------------------------
# Date                  : Sat Aug 14 07:58:36 2021
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.238-182.422.amzn2.x86_64
# Version               : #1 SMP Tue Jul 20 20:35:54 UTC 2021
# MPI Version           : 3.1
# MPI Thread Environment: 

# Calling sequence was: 

# /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA -npmin 576 -iter 200 -exclude All_put_all 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
# 
# 

# List of Benchmarks to run:

# Unidir_put
# Bidir_put
# Unidir_get
# Bidir_get
# Put_local
# Put_all_local
# One_put_all
# One_get_all
# All_get_all
# Exchange_put
# Exchange_get
# Accumulate
# Get_accumulate
# Fetch_and_op
# Compare_and_swap
#     Uses MPI_INT data type
# Truly_passive_put
#     The benchmark measures execution time of MPI_Put for 2 cases:
#     1) The target is waiting in MPI_Barrier call (t_pure value)
#     2) The target performs computation and then enters MPI_Barrier routine (t_ovrl value)

#----------------------------------------------------------------
# Benchmarking Unidir_put 
# #processes = 2 
# ( 574 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec      defects
            0          100         0.62         0.00         0.00
[ip-172-31-2-103:02121] *** Process received signal ***
[ip-172-31-2-103:02121] Signal: Segmentation fault (11)
[ip-172-31-2-103:02121] Signal code: Address not mapped (1)
[ip-172-31-2-103:02121] Failing at address: 0x8
[ip-172-31-2-103:02121] [ 0] /lib64/libpthread.so.0(+0x117e0)[0x7f9d977097e0]
[ip-172-31-2-103:02121] [ 1] /fsx/ompi/install/lib/libopen-pal.so.0(mca_btl_ofi_put+0x6d)[0x7f9d970ea50d]
[ip-172-31-2-103:02121] [ 2] /fsx/ompi/install/lib/libmpi.so.0(ompi_osc_rdma_put_contig+0x177)[0x7f9d984035a7]
[ip-172-31-2-103:02121] [ 3] /fsx/ompi/install/lib/libmpi.so.0(ompi_osc_rdma_put+0x327)[0x7f9d98407d77]
[ip-172-31-2-103:02121] [ 4] /fsx/ompi/install/lib/libmpi.so.0(MPI_Put+0x55)[0x7f9d982db395]
[ip-172-31-2-103:02121] [ 5] /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA[0x44d339]
[ip-172-31-2-103:02121] [ 6] /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA[0x42d79f]
[ip-172-31-2-103:02121] [ 7] /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA[0x42e9af]
[ip-172-31-2-103:02121] [ 8] /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA[0x405b7b]
[ip-172-31-2-103:02121] [ 9] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f9d9736e0ba]
[ip-172-31-2-103:02121] [10] /fsx/SubspaceBenchmarksRepo-ompi/spack/opt/spack/linux-amzn2-x86_64/gcc-7.3.1/intel-mpi-benchmarks-2019.6-wbo2ruuzge3mi2ujywxgjm7nmbh526bf/bin/IMB-RMA[0x40456a]
[ip-172-31-2-103:02121] *** End of error message ***
[ip-172-31-2-103.us-east-2.compute.internal:02091] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 132]
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node ip-172-31-2-103 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
return status: 139

wzamazon commented 3 years ago

I got the following back trace from the core dump:

#0 0x00007fd81e3bccc3 in mca_btl_ofi_put (btl=0x1e340b0, endpoint=0x21010f0, local_address=0x234c000,
remote_address=33637600, local_handle=0x0, remote_handle=0x0, size=1, flags=0, order=255,
cbfunc=0x7fd80e94a9fb <ompi_osc_rdma_put_complete_flush>, cbcontext=0x685e2a0, cbdata=0x1e25e00)
at btl_ofi_rdma.c:116
#1 0x00007fd80e94ac19 in ompi_osc_rdma_put_real (sync=0x685fc10, peer=0x23574e0, target_address=33637600,
target_handle=0x0, ptr=0x234c000, local_handle=0x0, size=1,
cb=0x7fd80e94a9fb <ompi_osc_rdma_put_complete_flush>, context=0x685e2a0, cbdata=0x1e25e00)
at osc_rdma_comm.c:457
#2 0x00007fd80e94aed6 in ompi_osc_rdma_put_contig (sync=0x685fc10, peer=0x23574e0, target_address=33637600,
target_handle=0x0, source_buffer=0x681b4e0, size=1, request=0x0) at osc_rdma_comm.c:529
#3 0x00007fd80e94a7ac in ompi_osc_rdma_master (sync=0x685fc10, local_address=0x681b4e0, local_count=1,
local_datatype=0x674860 <ompi_mpi_byte>, peer=0x23574e0, remote_address=33637600, remote_handle=0x0,
remote_count=1, remote_datatype=0x674860 <ompi_mpi_byte>, request=0x0, max_rdma_len=8388608,
rdma_fn=0x7fd80e94acd2 <ompi_osc_rdma_put_contig>, alloc_reqs=false) at osc_rdma_comm.c:350
#4 0x00007fd80e94ba4b in ompi_osc_rdma_put_w_req (sync=0x685fc10, origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, peer=0x23574e0, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, request=0x0) at osc_rdma_comm.c:772
#5 0x00007fd80e94bceb in ompi_osc_rdma_put (origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, target_rank=1, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, win=0x6819fb0) at osc_rdma_comm.c:834
#6 0x00007fd8264fe596 in PMPI_Put (origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, target_rank=1, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, win=0x6819fb0) at pput.c:82
#7 0x000000000044d209 in IMB_rma_single_put ()
#8 0x000000000042d66f in Bmark_descr::IMB_init_buffers_iter(comm_info*, iter_schedule*, Bench*, cmode*, int, int)
()
#9 0x000000000042e87f in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)3>, &IMB_rma_single_put>::run(scope_item const&) ()
#10 0x0000000000405a4b in main ()

So the btl/ofi component was trying to access a NULL pointer

wzamazon commented 3 years ago

Because this error does not happen to ompi 4.1.x, it must be caused by a commit unique to the master branch.

I did some bisect, and was able to locate the commit that caused the error:

commit a2b5bdefc7e094a9b989f8c84885dfd3b4cdf382
Author: Nathan Hjelm <hjelmn@google.com>
Date:   Mon Feb 8 17:10:10 2021 -0700

    osc/rdma: add support for "alternate" btls

    This commit updates osc/rdma to support using alternate BTLs when
    a primary BTL is not available. There may be at most two
    alternate BTLs in use at any time. The default is selected to
    cover shared memory (sm) and off-node (tcp).

    The priority of osc/rdma is a bit lower when using a set of
    alternate btls. This will allow another osc component to win if
    there is an alternative.

    Signed-off-by: Nathan Hjelm <hjelmn@google.com>

With some further debugging, I was able to locate the problem to the following 3 lines of code in allocated_state_shared:

https://github.com/open-mpi/ompi/blob/5e42613dd6cbfaf9679f208af20f1ca9778fa3b1/ompi/mca/osc/rdma/osc_rdma_component.c#L608

    if (local_size == global_size) {
         module->use_memory_registration = false;
    }

If I understand correctly, the intention of these 3 lines is that when all MPI ranks are on same machines, btl/sm will be used so there is no need to do memory registration. (thus set module->use_memory_registration to false).

However, this action is correct when alternate btls was used, but is not right when one original btl (such as btl/ofi) is used. When btl/ofi is selected, it will need memory registration even on same instance.

Meanwhile, if two alternate btls are being used, module->use_memory_registration should already be false. Because the function ompi_osc_rdma_query_alternate_btls will skip any btl that require memory registration.

In all, I believe these 3 lines are unnecessary and should be removed.

After removing these 3 lines, 2 processes test can pass.

However, 16 nodes test failed. The error is caused by a NULLendpoint in btl/ofi component. I believe this is a separate bug.

awlauria commented 3 years ago

@wzamazon your last comment -

However, 16 nodes test failed. The error is caused by a NULLendpoint in btl/ofi component. I believe this is a separate bug.

sounds familiar to what I see with osc/rdma + btl/tcp. Can you post a stack-trace?

wzamazon commented 3 years ago

Sure. The stack trace is:

Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 hwloc-libs-1.11.8-4.amzn2.x86_64 libatomic-7.3.1-13.amzn2.x86_64 libevent-2.0.21-4.amzn2.0.3.x86_64 libgcc-7.3.1-13.amzn2.x86_64 libibverbs-core-32.1-2.amzn2.0.2.x86_64 libnl3-3.2.28-4.amzn2.0.1.x86_64 libstdc++-7.3.1-13.amzn2.x86_64 libtool-ltdl-2.4.2-22.2.amzn2.0.2.x86_64 lustre-client-2.10.8-5.amzn2.x86_64 numactl-libs-2.0.9-7.amzn2.x86_64 zlib-1.2.7-18.amzn2.x86_64
(gdb) bt
#0  0x00007f30ec1b5c20 in raise () from /lib64/libc.so.6
#1  0x00007f30ec1b70c8 in abort () from /lib64/libc.so.6
#2  0x00007f30ec1ae9ca in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f30ec1aea42 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f30ebf04892 in mca_btl_ofi_rdma_completion_alloc (btl=0xd6e660, endpoint=0x0, ofi_context=0xe14050, local_address=0x64b9000, 
    local_handle=0x6138dd8, cbfunc=0x7f30ed36f594 <ompi_osc_rdma_atomic_complete>, cbcontext=0xab69a10, cbdata=0x0, type=3) at btl_ofi_rdma.c:29
#5  0x00007f30ebf0530e in mca_btl_ofi_afop (btl=0xd6e660, endpoint=0x0, local_address=0x64b9000, remote_address=139848121143624, 
    local_handle=0x6138dd8, remote_handle=0x7f30ed955130, op=MCA_BTL_ATOMIC_ADD, operand=4294967296, flags=0, order=255, 
    cbfunc=0x7f30ed36f594 <ompi_osc_rdma_atomic_complete>, cbcontext=0xab69a10, cbdata=0x0) at btl_ofi_atomics.c:57
#6  0x00007f30ed37595c in ompi_osc_rdma_btl_fop (cbcontext=0x0, cbdata=0x0, cbfunc=0x0, wait_for_completion=true, result=0x7ffdd6d38a78, flags=0, 
    operand=4294967296, op=1, address_handle=0x7f30ed955130, address=139848121143624, endpoint=0x0, btl_index=0 '\000', module=0xab5e460)
    at osc_rdma_lock.h:75
#7  ompi_osc_rdma_lock_btl_fop (wait_for_completion=<optimized out>, result=0x7ffdd6d38a78, operand=4294967296, op=1, address=139848121143624, 
    peer=0x6130fe0, module=0xab5e460) at osc_rdma_lock.h:113
#8  ompi_osc_rdma_lock_acquire_shared (module=0xab5e460, peer=0x6130fe0, value=4294967296, offset=0, check=4294967295) at osc_rdma_lock.h:311
#9  0x00007f30ed3785a6 in ompi_osc_rdma_lock_all_atomic (mpi_assert=0, win=0x62909c0) at osc_rdma_passive_target.c:355
#10 0x00007f30ed16fbc8 in PMPI_Win_lock_all (mpi_assert=0, win=0x62909c0) at pwin_lock_all.c:56
#11 0x000000000044db4f in IMB_rma_put_all_local (c_info=0x15b6bd0, size=0, iterations=0x15b6d08, run_mode=0x15b6d94, time=0x7ffdd6d38d00)
    at ../src_c/IMB_rma_put.c:291
#12 0x000000000042d657 in Bmark_descr::IMB_init_buffers_iter (this=0xbfbd20, c_info=0x15b6bd0, ITERATIONS=0x15b6d08, Bmark=0x15b6d78, 
    BMODE=0x15b6d94, iter=0, size=0) at helpers/helper_IMB_functions.h:607
#13 0x0000000000434123 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)3>, &IMB_rma_put_all_local>::run (this=0x15b6ba0, item=...)
    at helpers/original_benchmark.h:209
#14 0x0000000000405a9e in main (argc=7, argv=0x7ffdd6d39968) at imb.cpp:347

wzamazon commented 3 years ago

I believe this error is caused by the following lines of code in the same function:

           if (0 == i) {
                local_leader = peer;
            }

            ex_peer = (ompi_osc_rdma_peer_extended_t *) peer;

            /* set up peer state */
            if (module->use_cpu_atomics) {
                  ...
             } else {
                .....
                peer->state = (osc_rdma_counter_t) ((uintptr_t) state_region->base + state_base + module->state_size * i);
                if (i > 0) {
                    peer->state_endpoint = local_leader->state_endpoint;
                    peer->state_btl_index = local_leader->state_btl_index;
                }
            }

Here, local_leader is the first peer. so basically, all other peers get state_endpoint from 1st peer's stat_endpoint. However, the state_endpoint of 1st peer was not set!

So, IMO, the correct code should be:

         if (i == 0) {
                   peer->state_endpoint = peer->data_endpoint;
                   peer->state_btl_index = peer->data_btl_index;
        } else {
                    peer->state_endpoint = local_leader->state_endpoint;
                    peer->state_btl_index = local_leader->state_btl_index;
                }

This set the 1st peer's state_endpoint from data_endpoint.

wzamazon commented 3 years ago

@awlauria @hjelmn Does the above make sense? I will open PR to address the two issues.

wzamazon commented 3 years ago

Opened https://github.com/open-mpi/ompi/pull/9358

awlauria commented 3 years ago

@wzamazon thanks. Unfortunately it does not seem to be related to the rdma/tcp issues that I see.

wzamazon commented 3 years ago

PR has been merged

open-mpi / ompi