openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

UCX ibv_fork_init() causes libfabric to fail in Open MPI #6420

Open SeyedMir opened 3 years ago

SeyedMir commented 3 years ago

Describe the bug

When Open MPI is built with both UCX and OFI, it tries to load UCX PML and if UCX ends up not being used (because it did not load successfully, or because of priority), it causes libfabric (and hence CM PML) to fail too. This is because UCX makes a call to ibv_fork_init() https://github.com/openucx/ucx/blob/9c0104ec9a9863b863aa441690a8a935d7abb8a1/src/uct/ib/base/ib_md.c#L1529 On EC2 instances with EFA, this issue results in Open MPI failure unless the user excludes UCX PML.

[ip-172-31-71-148:38143] select: initializing pml component cm

libibverbs fork support is not supported by the EFA Libfabric
provider when memory registrations are handled by the provider.

Fork support may currently be enabled via the RDMAV_FORK_SAFE
or IBV_FORK_SAFE environment variable or another library in your
application may be calling ibv_fork_init().

Please refer to https://github.com/ofiwg/libfabric/issues/6332
for more information. Your job will now abort.
[ip-172-31-76-181:04758] *** Process received signal ***
[ip-172-31-76-181:04758] Signal: Aborted (6)
[ip-172-31-76-181:04758] Signal code:  (-6)
[ip-172-31-76-181:04758] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f7edc802040]
[ip-172-31-76-181:04758] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f7edc801fb7]
[ip-172-31-76-181:04758] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f7edc803921]
[ip-172-31-76-181:04758] [ 3] /opt/amazon/efa/lib/libfabric.so.1(+0x5b8ec)[0x7f7ed47d28ec]
[ip-172-31-76-181:04758] [ 4] /opt/amazon/efa/lib/libfabric.so.1(+0x63937)[0x7f7ed47da937]
[ip-172-31-76-181:04758] [ 5] /home/ubuntu/repos/ompi/_build/lib/openmpi/mca_mtl_ofi.so(+0x7b62)[0x7f7eccde5b62]
[ip-172-31-76-181:04758] [ 6] /home/ubuntu/repos/ompi/_build/lib/openmpi/mca_mtl_ofi.so(+0xbdee)[0x7f7eccde9dee]
[ip-172-31-76-181:04758] [ 7] /home/ubuntu/repos/ompi/_build/lib/libmpi.so.0(ompi_mtl_base_select+0xe2)[0x7f7edccfd0ec]
[ip-172-31-76-181:04758] [ 8] /home/ubuntu/repos/ompi/_build/lib/openmpi/mca_pml_cm.so(+0x6f29)[0x7f7ece04af29]
[ip-172-31-76-181:04758] [ 9] /home/ubuntu/repos/ompi/_build/lib/libmpi.so.0(mca_pml_base_select+0x384)[0x7f7edcd0f7a2]
[ip-172-31-76-181:04758] [10] /home/ubuntu/repos/ompi/_build/lib/libmpi.so.0(ompi_mpi_init+0x9d2)[0x7f7edcd28aa3]
[ip-172-31-76-181:04758] [11] /home/ubuntu/repos/ompi/_build/lib/libmpi.so.0(MPI_Init+0x8e)[0x7f7edcc7e6ce]
[ip-172-31-76-181:04758] [12] /home/ubuntu/packages/osu-micro-benchmarks-5.6.3/_build/libexec/osu-micro-benchmarks/mpi/pt2pt//osu_latency(+0x17a2)[0x556398d907a2]

Steps to Reproduce

yosefe commented 3 years ago

why ibv_fork_init() is bad for EFA?

yosefe commented 3 years ago

According to https://github.com/ofiwg/libfabric/issues/6332, this seems more of a limitation of libfabric EFA provider, rather than incorrect behavior in UCX

SeyedMir commented 3 years ago

@rajachan

rajachan commented 3 years ago

The limitation with the EFA provider is that it can not support MPI applications that also require fork support today. As for this issue itself, the UCX PML is initializing fork support even when it is not the chosen PML, so it is leaving state around that impacts another PML. Any reason why ibv_fork_init() can not be called after a particular transport has been chosen successfully? In the OMPI case, this would mean the fork safety would be initalized only if the UCX PML has a device and a transport it is going to successfully open and use.

rajachan commented 3 years ago

From a quick read of the code, uct_ib_md_open() gets a list of RDMA devices with ibv_get_device_list(). It goes on to call ibv_fork_init() unless UCX_IB_FORK_INIT=n configuration is set. Then, for each device, it tries to open the device through one of the many underlying transports in UCX. uct_ib_verbs_md_open() does think it can talk to EFA with RC semantics and opens the device. Further down the init path, the UCT fails to create a shared receive queue and fails the UCX PML init and lets pml/cm be selected while fork support has been initialized.

yosefe commented 3 years ago

ibv_fork_init() has to be called before any other Verbs call which may trigger memory registration https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/memory.c#L711 So, in order to check for UCX support, OpenMPI first need to initialize UCX, which includes calls to QP create, which also calls ibv_donfork_range.

adding @jgunthorpe From rdma-core Verbs API perspective, it should be legitimate to call ibv_fork_init() and expect that the relevant providers will support it, or ibv_fork_init() would return an error. It's weird that ibv_fork_init() is successful, but later on, one of the providers calls abort() because can't support it.

SeyedMir commented 3 years ago

This is what we use to decide whether RC is supported or not: https://github.com/openucx/ucx/blob/ad463435c014f18e98be04e2f4d54f27d15cf0f8/src/uct/ib/rc/verbs/rc_verbs_iface.c#L494 if ib_md->config.eth_pause is set, then we don't check for port link layer type and assume RC is supported. eth_pause is enabled by default through UCX_IB_ETH_PAUSE_ON=y

#
# Whether or not 'Pause Frame' is enabled on an Ethernet network.
# Pause frame is a mechanism for temporarily stopping the transmission of data to
# ensure zero loss under congestion on Ethernet family computer networks.
# This parameter, if set to 'no', will disqualify IB transports that may not perform
# well on a lossy fabric when working with RoCE.
#
# syntax:    <y|n>
#
UCX_IB_ETH_PAUSE_ON=y
yosefe commented 3 years ago

UCX_IB_ETH_PAUSE_ON is no longer relevant and should be removed.

On other hand, we check if DC is supported by creating a DCT QP, which requires creating a CQ, which allocates a doorbell record page, which calls ibv_dontfork_range(). [ in uct_ib_mlx5dv_check_dc() ]

AFAIK the safe way is to call ibv_fork_init() before any other verbs function.

jgunthorpe commented 3 years ago

@yosefe doesn't look like an rdma-core issue, EFA provider looks like it will work with fork support, or at least it doesn't abort. The abort is coming from libfabric and I Gal once told me that is because the libfabric mr cacher is incompatible with EFA and fork support or something.

rajachan commented 3 years ago

Jason is mostly right, in that it has to do with EFA's use of a registration cache in libfabric. However, to prevent silent data corruption, the libfabric EFA provider detects if ibv_fork_init() has been called and aborts with the error @SeyedMir posted under certain conditions. This will all change for the better in the not-too-distant-future (see discussions in https://github.com/linux-rdma/rdma-core/pull/883). We are also working on some changes to the libfabric provider to handle this better till that kernel version gets becomes more mainstream.

This becomes a non-issue with Open MPI in light of https://github.com/open-mpi/ompi/pull/8496, but we should still make sure the RC QP detection is handled better in UCX. A simple ucx_info -d shows (incorrectly) that the EFA device can be used with the rc_verbs transport:

#      Transport: rc_verbs
#         Device: efa_0:1
#  System device: 0000:10:1b.0 (0)
[1614659259.376522] [p4d-st-p4d24xlarge-1:2340 :0]        rc_iface.c:499  UCX  ERROR ibv_create_srq() failed: Operation not supported
#   < failed to open interface >
rajachan commented 3 years ago

Looks like https://github.com/openucx/ucx/pull/6883 fixes the RC QP detection issue I mentioned in the previous comment.

SeyedMir commented 3 years ago

That's right. I created a separate issue https://github.com/openucx/ucx/issues/6882 for it because this issue is mainly about fork_init, so it should not be closed when https://github.com/openucx/ucx/pull/6883 is merged.