open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

orte_ess_init failed while trying to run Horovod application #8193

Open BKitor opened 3 years ago

BKitor commented 3 years ago

Background information

I'm trying to run Open-MPI with Horovod and it's breaking during MPI_Init(). I think it's something to do with pmi.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

openmpi-4.1.0rc2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Tarball distribution from open-mpi.org, I did rerun perl autogen.pl in order to pick up an mca component I'm working on

Please describe the system on which you are running

GCC 8.3, CUDA/10.1,


Details of the problem

I'm trying to get Horovod(a deeplearning tool built on Python+MPI) to run with OpenMPI and something is breaking during MPI_Init(). I want to say it's something to do with the pmi layer.

I've been able to run programs from OSU_microbenchmarks, a single threadded program, without any issues. Each Horovod process spawns a background thread and it's those threads who are responsible for calling MPI_Init(), I think my issue has something to do with that.

There is a pre-installed copy of OpenMPI 3.2.1 on the cluster, which runs Horovod without any issues, i've tried linking the libpmi.so it uses, but still doesn't work.

I've pasted the error message I get below.

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
It looks like MPI_INIT failed for some reason; your parallel process is
--------------------------------------------------------------------------
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
rhc54 commented 3 years ago

How are you launching the application (i.e., using "mpirun" or some other launcher)?

BKitor commented 3 years ago

How are you launching the application (i.e., using "mpirun" or some other launcher)?

mpirun

jsquyres commented 3 years ago

There's not quite enough information here to help diagnose what is happening. Can you supply all the diagnostic information requested from https://www.open-mpi.org/community/help/?

In particular, you mentioned that you're working on a home-grown MCA component. Can you describe the type / function of that component?

BKitor commented 3 years ago

Here's a GitHub Gist with the built info

I'm working on a component for the collective framework. It's dependent on information from Hwloc and the base collective component, but shouldn't mess with MPI_Init().

rhc54 commented 3 years ago

I guess the initial question would be: can you "mpirun" anything with the v4.1.0 version? Like "hostname" and a simple MPI "hello" program?

BKitor commented 3 years ago

I have been able to run OSU Microbenchmarks just fine (mpirun -n 4 osu_allreduce). This issue is specific to Horovod. I can launch Horovod with mpirun (mpirun -n 4 python horovod_script.py) and print to std_out, but as soon as I call hvd.init() it breaks.

From what I understand of the two applications, OMB is a single threaded c script that enters at main() runs function a couple thousand times and exits. Horovod on the other hand starts as a Python script, opens a .so file with ctypes, the library spawns a thread with std::thread, then that thread calls MPI_Init().

rhc54 commented 3 years ago

Hmmm...well, if you build OMPI v4.1.0 with --enable-debug and then add --mca ess_base_verbose 10 to the mpirun cmd line you should get some output that might help understand why the proc is failing to initialize.

BKitor commented 3 years ago

Here's a gist, launched with mpirun -n 2 --mca btl ^openib --mca ess_base_verbose 10 --bind-to core --report-bindings python $SCRATCH/test_hv.py

rhc54 commented 3 years ago

Sorry this fell off of our radar. It looks like there are some missing libraries on at least one of your backend nodes. The first couple of procs are able to properly initialize, but the others fail due to a lack of PMIx support. Check your backend LD_LIBRARY_PATH and the OMPI install location on those nodes and ensure there isn't some confusion - e.g., the backend using an earlier version of OMPI, or not having a complete install of the OMPI lib directory tree.

LourensVeen commented 2 months ago

I'm not sure it's the same issue, but Google points here if you search for orte_ess_init failed, so in case it helps anyone: I got this orte_ess_init failed error after trying to run an executable compiled with OpenMPI 4.1.2 using OpenMPI 5.0.3.

So if you're seeing this, double-check that you have the right module loaded, and/or that you recompiled your program.