open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

UCX issues with ibv_fork_init() and Open MPI 4.1.x #10436

Open mkre opened 2 years ago

mkre commented 2 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from a source tarball.

Please describe the system on which you are running


Details of the problem

We are seeing problems with Open MPI 4.1.2 and UCX trying to call ibv_fork_init(). Originally, we were seeing occasional UCX warnings

ib_md.c:1161 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
ib_md.c:1162 UCX  WARN  IB: data corruption might occur when using registered memory.

which under some circumstances led to abortions/hangs. As expected, setting UCX_IB_FORK_INIT=yes instead of the default try leads to reproducible abortions. Interestingly, the issues does not appear when using Open MPI 4.0.3 instead of 4.1.2. A bit of digging around brought me to https://github.com/openucx/ucx/issues/686, where it says that UCX is always using ibv_fork_init() but it could happen that

OpenMPI is [...] using verbs without calling ibv_fork_init(), and then UCX fails when it calls ibv_fork_init()

So, it seems like there might be issues when verbs functions are called before UCX is trying to initialize with ibv_fork_init(). Using a debugger with a breakpoint at ibv_open_device() showed me that indeed, this function is called before UCX is calling ibv_fork_init(). In fact, it is called by BTL/ofi, which is new in Open MPI 4.1.x which in turn explains that we are not seeing the same issue with Open MPI 4.0.3.

Since we are relying on UCX and don't do one-sided communication, I don't think we are actually using BTL/ofi. It rather seems like it is only being initialized during the Open MPI init process and then kept active. Disabling this component using -mca btl ^ofi avoids the UCX warnings as well as all downstream problems.

There are a number of other MCA components which are calling ibv_open_device() (e.g., BTL/usnic, MTL/ofi), but these do not lead to the same issues, possibly because they are not kept active during the application run.

While we can just disable BTL/ofi and should be all set for Open MPI 4.1.2, I am wondering about two things:

  1. How can we make sure that newly added MCA components do not lead to the same issue going forward?
  2. Could this be improved in future versions of Open MPI? If UCX tries to call ibv_fork_init() which requires no non-fork-safe IBV component to be active, maybe there is a way to make sure UCX goes first (if present) because it has the highest requirements? Alternatively, how about a "common" MCA parameter to enable fork-safe IBV initialization for all MCA components, which should always work regardless of the initialization order?

Thanks, Moritz

bartoldeman commented 2 years ago

I can just add a bit of perspective, not a real solution. We (Compute Canada) use a single Open MPI installation on multiple clusters, some have omnipath, many infiniband, some only ethernet, and for us this is just a fact of live that you need to set various MCA environment variables to make it work properly (or you could set them in a configuration file). Sometimes otherwise we have issues as above, and sometimes just confusing warnings (e.g. the old mxm mtl, even if only one mtl is used at runtime it still went through an init phase and spit out a warning before moving on on non-IB hardware)

An added benefit is that this improves startup times, and if the relevant component is built as a DSO, that DSO isn't even loaded at runtime, and neither are the underlying libraries (libfabric.so in this instance, if both the ofi mtl and btl are DSOs).

So presently on IB clusters we explicitly disable OFI:

OMPI_MCA_btl='^ofi'
OMPI_MCA_mtl='^ofi'

But on omnipath/ethernet we explicitly disable UCX:

OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'

if those were separate Open MPI installations I'd disable at compile (configure) time however.

mkre commented 2 years ago

@bartoldeman, thanks a lot for your input. Your usage scenario is pretty similar to ours. We are already applying quite a few of these manual MCA settings to prevent different issues (another one which just crossed my mind is https://github.com/openucx/ucx/issues/4866). While I understand that cross compiling Open MPI is not what most people are doing, I am wondering if this one here is something which could be prevented on the Open MPI level because it doesn't only lead to false warnings, but rather to actual intermittent problems despite we are not using BTL/ofi. But maybe we just have to live with it, keep monitoring the changelogs, and carefully disable components on a case-by-case basis.