Open SeyedMir opened 3 years ago
why ibv_fork_init() is bad for EFA?
According to https://github.com/ofiwg/libfabric/issues/6332, this seems more of a limitation of libfabric EFA provider, rather than incorrect behavior in UCX
@rajachan
The limitation with the EFA provider is that it can not support MPI applications that also require fork support today. As for this issue itself, the UCX PML is initializing fork support even when it is not the chosen PML, so it is leaving state around that impacts another PML. Any reason why ibv_fork_init()
can not be called after a particular transport has been chosen successfully? In the OMPI case, this would mean the fork safety would be initalized only if the UCX PML has a device and a transport it is going to successfully open and use.
From a quick read of the code, uct_ib_md_open()
gets a list of RDMA devices with ibv_get_device_list()
. It goes on to call ibv_fork_init()
unless UCX_IB_FORK_INIT=n
configuration is set. Then, for each device, it tries to open the device through one of the many underlying transports in UCX. uct_ib_verbs_md_open()
does think it can talk to EFA with RC semantics and opens the device. Further down the init path, the UCT fails to create a shared receive queue and fails the UCX PML init and lets pml/cm
be selected while fork support has been initialized.
ibv_fork_init() has to be called before any other Verbs call which may trigger memory registration https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/memory.c#L711
So, in order to check for UCX support, OpenMPI first need to initialize UCX, which includes calls to QP create, which also calls ibv_donfork_range
.
adding @jgunthorpe
From rdma-core Verbs API perspective, it should be legitimate to call ibv_fork_init() and expect that the relevant providers will support it, or ibv_fork_init() would return an error. It's weird that ibv_fork_init() is successful, but later on, one of the providers calls abort()
because can't support it.
This is what we use to decide whether RC is supported or not:
https://github.com/openucx/ucx/blob/ad463435c014f18e98be04e2f4d54f27d15cf0f8/src/uct/ib/rc/verbs/rc_verbs_iface.c#L494
if ib_md->config.eth_pause
is set, then we don't check for port link layer type and assume RC is supported.
eth_pause
is enabled by default through UCX_IB_ETH_PAUSE_ON=y
#
# Whether or not 'Pause Frame' is enabled on an Ethernet network.
# Pause frame is a mechanism for temporarily stopping the transmission of data to
# ensure zero loss under congestion on Ethernet family computer networks.
# This parameter, if set to 'no', will disqualify IB transports that may not perform
# well on a lossy fabric when working with RoCE.
#
# syntax: <y|n>
#
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ETH_PAUSE_ON is no longer relevant and should be removed.
On other hand, we check if DC is supported by creating a DCT QP, which requires creating a CQ, which allocates a doorbell record page, which calls ibv_dontfork_range(). [ in uct_ib_mlx5dv_check_dc() ]
AFAIK the safe way is to call ibv_fork_init() before any other verbs function.
@yosefe doesn't look like an rdma-core issue, EFA provider looks like it will work with fork support, or at least it doesn't abort. The abort is coming from libfabric and I Gal once told me that is because the libfabric mr cacher is incompatible with EFA and fork support or something.
Jason is mostly right, in that it has to do with EFA's use of a registration cache in libfabric. However, to prevent silent data corruption, the libfabric EFA provider detects if ibv_fork_init() has been called and aborts with the error @SeyedMir posted under certain conditions. This will all change for the better in the not-too-distant-future (see discussions in https://github.com/linux-rdma/rdma-core/pull/883). We are also working on some changes to the libfabric provider to handle this better till that kernel version gets becomes more mainstream.
This becomes a non-issue with Open MPI in light of https://github.com/open-mpi/ompi/pull/8496, but we should still make sure the RC QP detection is handled better in UCX. A simple ucx_info -d
shows (incorrectly) that the EFA device can be used with the rc_verbs
transport:
# Transport: rc_verbs
# Device: efa_0:1
# System device: 0000:10:1b.0 (0)
[1614659259.376522] [p4d-st-p4d24xlarge-1:2340 :0] rc_iface.c:499 UCX ERROR ibv_create_srq() failed: Operation not supported
# < failed to open interface >
Looks like https://github.com/openucx/ucx/pull/6883 fixes the RC QP detection issue I mentioned in the previous comment.
That's right. I created a separate issue https://github.com/openucx/ucx/issues/6882 for it because this issue is mainly about fork_init, so it should not be closed when https://github.com/openucx/ucx/pull/6883 is merged.
Describe the bug
When Open MPI is built with both UCX and OFI, it tries to load UCX PML and if UCX ends up not being used (because it did not load successfully, or because of priority), it causes libfabric (and hence CM PML) to fail too. This is because UCX makes a call to
ibv_fork_init()
https://github.com/openucx/ucx/blob/9c0104ec9a9863b863aa441690a8a935d7abb8a1/src/uct/ib/base/ib_md.c#L1529 On EC2 instances with EFA, this issue results in Open MPI failure unless the user excludes UCX PML.Steps to Reproduce