Open mkre opened 2 years ago
I can just add a bit of perspective, not a real solution. We (Compute Canada) use a single Open MPI installation on multiple clusters, some have omnipath, many infiniband, some only ethernet, and for us this is just a fact of live that you need to set various MCA environment variables to make it work properly (or you could set them in a configuration file). Sometimes otherwise we have issues as above, and sometimes just confusing warnings (e.g. the old mxm mtl, even if only one mtl is used at runtime it still went through an init phase and spit out a warning before moving on on non-IB hardware)
An added benefit is that this improves startup times, and if the relevant component is built as a DSO, that DSO isn't even loaded at runtime, and neither are the underlying libraries (libfabric.so in this instance, if both the ofi mtl and btl are DSOs).
So presently on IB clusters we explicitly disable OFI:
OMPI_MCA_btl='^ofi'
OMPI_MCA_mtl='^ofi'
But on omnipath/ethernet we explicitly disable UCX:
OMPI_MCA_osc='^ucx'
OMPI_MCA_pml='^ucx'
if those were separate Open MPI installations I'd disable at compile (configure) time however.
@bartoldeman, thanks a lot for your input. Your usage scenario is pretty similar to ours. We are already applying quite a few of these manual MCA settings to prevent different issues (another one which just crossed my mind is https://github.com/openucx/ucx/issues/4866). While I understand that cross compiling Open MPI is not what most people are doing, I am wondering if this one here is something which could be prevented on the Open MPI level because it doesn't only lead to false warnings, but rather to actual intermittent problems despite we are not using BTL/ofi. But maybe we just have to live with it, keep monitoring the changelogs, and carefully disable components on a case-by-case basis.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from a source tarball.
Please describe the system on which you are running
Details of the problem
We are seeing problems with Open MPI 4.1.2 and UCX trying to call
ibv_fork_init()
. Originally, we were seeing occasional UCX warningswhich under some circumstances led to abortions/hangs. As expected, setting
UCX_IB_FORK_INIT=yes
instead of the defaulttry
leads to reproducible abortions. Interestingly, the issues does not appear when using Open MPI 4.0.3 instead of 4.1.2. A bit of digging around brought me to https://github.com/openucx/ucx/issues/686, where it says that UCX is always usingibv_fork_init()
but it could happen thatSo, it seems like there might be issues when verbs functions are called before UCX is trying to initialize with
ibv_fork_init()
. Using a debugger with a breakpoint atibv_open_device()
showed me that indeed, this function is called before UCX is callingibv_fork_init()
. In fact, it is called by BTL/ofi, which is new in Open MPI 4.1.x which in turn explains that we are not seeing the same issue with Open MPI 4.0.3.Since we are relying on UCX and don't do one-sided communication, I don't think we are actually using BTL/ofi. It rather seems like it is only being initialized during the Open MPI init process and then kept active. Disabling this component using
-mca btl ^ofi
avoids the UCX warnings as well as all downstream problems.There are a number of other MCA components which are calling
ibv_open_device()
(e.g., BTL/usnic, MTL/ofi), but these do not lead to the same issues, possibly because they are not kept active during the application run.While we can just disable BTL/ofi and should be all set for Open MPI 4.1.2, I am wondering about two things:
ibv_fork_init()
which requires no non-fork-safe IBV component to be active, maybe there is a way to make sure UCX goes first (if present) because it has the highest requirements? Alternatively, how about a "common" MCA parameter to enable fork-safe IBV initialization for all MCA components, which should always work regardless of the initialization order?Thanks, Moritz