Open BKitor opened 3 years ago
How are you launching the application (i.e., using "mpirun" or some other launcher)?
How are you launching the application (i.e., using "mpirun" or some other launcher)?
mpirun
There's not quite enough information here to help diagnose what is happening. Can you supply all the diagnostic information requested from https://www.open-mpi.org/community/help/?
In particular, you mentioned that you're working on a home-grown MCA component. Can you describe the type / function of that component?
Here's a GitHub Gist with the built info
I'm working on a component for the collective framework. It's dependent on information from Hwloc and the base collective component, but shouldn't mess with MPI_Init().
I guess the initial question would be: can you "mpirun" anything with the v4.1.0 version? Like "hostname" and a simple MPI "hello" program?
I have been able to run OSU Microbenchmarks just fine (mpirun -n 4 osu_allreduce). This issue is specific to Horovod. I can launch Horovod with mpirun (mpirun -n 4 python horovod_script.py) and print to std_out, but as soon as I call hvd.init() it breaks.
From what I understand of the two applications, OMB is a single threaded c script that enters at main() runs function a couple thousand times and exits.
Horovod on the other hand starts as a Python script, opens a .so file with ctypes, the library spawns a thread with std::thread
, then that thread calls MPI_Init().
Hmmm...well, if you build OMPI v4.1.0 with --enable-debug
and then add --mca ess_base_verbose 10
to the mpirun cmd line you should get some output that might help understand why the proc is failing to initialize.
Here's a gist, launched with mpirun -n 2 --mca btl ^openib --mca ess_base_verbose 10 --bind-to core --report-bindings python $SCRATCH/test_hv.py
Sorry this fell off of our radar. It looks like there are some missing libraries on at least one of your backend nodes. The first couple of procs are able to properly initialize, but the others fail due to a lack of PMIx support. Check your backend LD_LIBRARY_PATH and the OMPI install location on those nodes and ensure there isn't some confusion - e.g., the backend using an earlier version of OMPI, or not having a complete install of the OMPI lib directory tree.
I'm not sure it's the same issue, but Google points here if you search for orte_ess_init failed
, so in case it helps anyone: I got this orte_ess_init failed
error after trying to run an executable compiled with OpenMPI 4.1.2 using OpenMPI 5.0.3.
So if you're seeing this, double-check that you have the right module loaded, and/or that you recompiled your program.
Background information
I'm trying to run Open-MPI with Horovod and it's breaking during MPI_Init(). I think it's something to do with pmi.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
openmpi-4.1.0rc2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Tarball distribution from open-mpi.org, I did rerun
perl autogen.pl
in order to pick up an mca component I'm working onPlease describe the system on which you are running
GCC 8.3, CUDA/10.1,
Details of the problem
I'm trying to get Horovod(a deeplearning tool built on Python+MPI) to run with OpenMPI and something is breaking during MPI_Init(). I want to say it's something to do with the pmi layer.
I've been able to run programs from OSU_microbenchmarks, a single threadded program, without any issues. Each Horovod process spawns a background thread and it's those threads who are responsible for calling MPI_Init(), I think my issue has something to do with that.
There is a pre-installed copy of OpenMPI 3.2.1 on the cluster, which runs Horovod without any issues, i've tried linking the libpmi.so it uses, but still doesn't work.
I've pasted the error message I get below.