Closed wzamazon closed 3 years ago
I got the following back trace from the core dump:
#0 0x00007fd81e3bccc3 in mca_btl_ofi_put (btl=0x1e340b0, endpoint=0x21010f0, local_address=0x234c000,
remote_address=33637600, local_handle=0x0, remote_handle=0x0, size=1, flags=0, order=255,
cbfunc=0x7fd80e94a9fb <ompi_osc_rdma_put_complete_flush>, cbcontext=0x685e2a0, cbdata=0x1e25e00)
at btl_ofi_rdma.c:116
#1 0x00007fd80e94ac19 in ompi_osc_rdma_put_real (sync=0x685fc10, peer=0x23574e0, target_address=33637600,
target_handle=0x0, ptr=0x234c000, local_handle=0x0, size=1,
cb=0x7fd80e94a9fb <ompi_osc_rdma_put_complete_flush>, context=0x685e2a0, cbdata=0x1e25e00)
at osc_rdma_comm.c:457
#2 0x00007fd80e94aed6 in ompi_osc_rdma_put_contig (sync=0x685fc10, peer=0x23574e0, target_address=33637600,
target_handle=0x0, source_buffer=0x681b4e0, size=1, request=0x0) at osc_rdma_comm.c:529
#3 0x00007fd80e94a7ac in ompi_osc_rdma_master (sync=0x685fc10, local_address=0x681b4e0, local_count=1,
local_datatype=0x674860 <ompi_mpi_byte>, peer=0x23574e0, remote_address=33637600, remote_handle=0x0,
remote_count=1, remote_datatype=0x674860 <ompi_mpi_byte>, request=0x0, max_rdma_len=8388608,
rdma_fn=0x7fd80e94acd2 <ompi_osc_rdma_put_contig>, alloc_reqs=false) at osc_rdma_comm.c:350
#4 0x00007fd80e94ba4b in ompi_osc_rdma_put_w_req (sync=0x685fc10, origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, peer=0x23574e0, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, request=0x0) at osc_rdma_comm.c:772
#5 0x00007fd80e94bceb in ompi_osc_rdma_put (origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, target_rank=1, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, win=0x6819fb0) at osc_rdma_comm.c:834
#6 0x00007fd8264fe596 in PMPI_Put (origin_addr=0x681b4e0, origin_count=1,
origin_datatype=0x674860 <ompi_mpi_byte>, target_rank=1, target_disp=0, target_count=1,
target_datatype=0x674860 <ompi_mpi_byte>, win=0x6819fb0) at pput.c:82
#7 0x000000000044d209 in IMB_rma_single_put ()
#8 0x000000000042d66f in Bmark_descr::IMB_init_buffers_iter(comm_info*, iter_schedule*, Bench*, cmode*, int, int)
()
#9 0x000000000042e87f in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)3>, &IMB_rma_single_put>::run(scope_item const&) ()
#10 0x0000000000405a4b in main ()
So the btl/ofi component was trying to access a NULL pointer
Because this error does not happen to ompi 4.1.x, it must be caused by a commit unique to the master branch.
I did some bisect, and was able to locate the commit that caused the error:
commit a2b5bdefc7e094a9b989f8c84885dfd3b4cdf382
Author: Nathan Hjelm <hjelmn@google.com>
Date: Mon Feb 8 17:10:10 2021 -0700
osc/rdma: add support for "alternate" btls
This commit updates osc/rdma to support using alternate BTLs when
a primary BTL is not available. There may be at most two
alternate BTLs in use at any time. The default is selected to
cover shared memory (sm) and off-node (tcp).
The priority of osc/rdma is a bit lower when using a set of
alternate btls. This will allow another osc component to win if
there is an alternative.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
With some further debugging, I was able to locate the problem to the following 3 lines of code in allocated_state_shared
:
if (local_size == global_size) {
module->use_memory_registration = false;
}
If I understand correctly, the intention of these 3 lines is that when all MPI ranks are on same machines, btl/sm will be used so there is no need to do memory registration. (thus set module->use_memory_registration
to false).
However, this action is correct when alternate btls was used, but is not right when one original btl (such as btl/ofi) is used. When btl/ofi is selected, it will need memory registration even on same instance.
Meanwhile, if two alternate btls are being used, module->use_memory_registration
should already be false. Because the function ompi_osc_rdma_query_alternate_btls
will skip any btl that require memory registration.
In all, I believe these 3 lines are unnecessary and should be removed.
After removing these 3 lines, 2 processes test can pass.
However, 16 nodes test failed. The error is caused by a NULLendpoint
in btl/ofi component. I believe this is a separate bug.
@wzamazon your last comment -
However, 16 nodes test failed. The error is caused by a NULLendpoint in btl/ofi component. I believe this is a separate bug.
sounds familiar to what I see with osc/rdma + btl/tcp. Can you post a stack-trace?
Sure. The stack trace is:
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 hwloc-libs-1.11.8-4.amzn2.x86_64 libatomic-7.3.1-13.amzn2.x86_64 libevent-2.0.21-4.amzn2.0.3.x86_64 libgcc-7.3.1-13.amzn2.x86_64 libibverbs-core-32.1-2.amzn2.0.2.x86_64 libnl3-3.2.28-4.amzn2.0.1.x86_64 libstdc++-7.3.1-13.amzn2.x86_64 libtool-ltdl-2.4.2-22.2.amzn2.0.2.x86_64 lustre-client-2.10.8-5.amzn2.x86_64 numactl-libs-2.0.9-7.amzn2.x86_64 zlib-1.2.7-18.amzn2.x86_64
(gdb) bt
#0 0x00007f30ec1b5c20 in raise () from /lib64/libc.so.6
#1 0x00007f30ec1b70c8 in abort () from /lib64/libc.so.6
#2 0x00007f30ec1ae9ca in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f30ec1aea42 in __assert_fail () from /lib64/libc.so.6
#4 0x00007f30ebf04892 in mca_btl_ofi_rdma_completion_alloc (btl=0xd6e660, endpoint=0x0, ofi_context=0xe14050, local_address=0x64b9000,
local_handle=0x6138dd8, cbfunc=0x7f30ed36f594 <ompi_osc_rdma_atomic_complete>, cbcontext=0xab69a10, cbdata=0x0, type=3) at btl_ofi_rdma.c:29
#5 0x00007f30ebf0530e in mca_btl_ofi_afop (btl=0xd6e660, endpoint=0x0, local_address=0x64b9000, remote_address=139848121143624,
local_handle=0x6138dd8, remote_handle=0x7f30ed955130, op=MCA_BTL_ATOMIC_ADD, operand=4294967296, flags=0, order=255,
cbfunc=0x7f30ed36f594 <ompi_osc_rdma_atomic_complete>, cbcontext=0xab69a10, cbdata=0x0) at btl_ofi_atomics.c:57
#6 0x00007f30ed37595c in ompi_osc_rdma_btl_fop (cbcontext=0x0, cbdata=0x0, cbfunc=0x0, wait_for_completion=true, result=0x7ffdd6d38a78, flags=0,
operand=4294967296, op=1, address_handle=0x7f30ed955130, address=139848121143624, endpoint=0x0, btl_index=0 '\000', module=0xab5e460)
at osc_rdma_lock.h:75
#7 ompi_osc_rdma_lock_btl_fop (wait_for_completion=<optimized out>, result=0x7ffdd6d38a78, operand=4294967296, op=1, address=139848121143624,
peer=0x6130fe0, module=0xab5e460) at osc_rdma_lock.h:113
#8 ompi_osc_rdma_lock_acquire_shared (module=0xab5e460, peer=0x6130fe0, value=4294967296, offset=0, check=4294967295) at osc_rdma_lock.h:311
#9 0x00007f30ed3785a6 in ompi_osc_rdma_lock_all_atomic (mpi_assert=0, win=0x62909c0) at osc_rdma_passive_target.c:355
#10 0x00007f30ed16fbc8 in PMPI_Win_lock_all (mpi_assert=0, win=0x62909c0) at pwin_lock_all.c:56
#11 0x000000000044db4f in IMB_rma_put_all_local (c_info=0x15b6bd0, size=0, iterations=0x15b6d08, run_mode=0x15b6d94, time=0x7ffdd6d38d00)
at ../src_c/IMB_rma_put.c:291
#12 0x000000000042d657 in Bmark_descr::IMB_init_buffers_iter (this=0xbfbd20, c_info=0x15b6bd0, ITERATIONS=0x15b6d08, Bmark=0x15b6d78,
BMODE=0x15b6d94, iter=0, size=0) at helpers/helper_IMB_functions.h:607
#13 0x0000000000434123 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)3>, &IMB_rma_put_all_local>::run (this=0x15b6ba0, item=...)
at helpers/original_benchmark.h:209
#14 0x0000000000405a9e in main (argc=7, argv=0x7ffdd6d39968) at imb.cpp:347
I believe this error is caused by the following lines of code in the same function:
if (0 == i) {
local_leader = peer;
}
ex_peer = (ompi_osc_rdma_peer_extended_t *) peer;
/* set up peer state */
if (module->use_cpu_atomics) {
...
} else {
.....
peer->state = (osc_rdma_counter_t) ((uintptr_t) state_region->base + state_base + module->state_size * i);
if (i > 0) {
peer->state_endpoint = local_leader->state_endpoint;
peer->state_btl_index = local_leader->state_btl_index;
}
}
Here, local_leader
is the first peer. so basically, all other peers get state_endpoint
from 1st peer's stat_endpoint
. However, the state_endpoint
of 1st peer was not set!
So, IMO, the correct code should be:
if (i == 0) {
peer->state_endpoint = peer->data_endpoint;
peer->state_btl_index = peer->data_btl_index;
} else {
peer->state_endpoint = local_leader->state_endpoint;
peer->state_btl_index = local_leader->state_btl_index;
}
This set the 1st peer's state_endpoint
from data_endpoint
.
@awlauria @hjelmn Does the above make sense? I will open PR to address the two issues.
@wzamazon thanks. Unfortunately it does not seem to be related to the rdma/tcp issues that I see.
PR has been merged
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
master branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from git clone, then configured with
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.e85b814db68e46d1a9daa30a89d9e00f964fbd05 3rd-party/openpmix (v1.1.3-3095-ge85b814d) fe0cc05e9cf7ff4b49565afcc334937d7e0b995b 3rd-party/prrte (psrvr-v2.0.0rc1-3983-gfe0cc05e9c)
Please describe the system on which you are running
Details of the problem
When running IMB-RMA with ompi master branch, the application crashed: