hpc / charliecloud

Lightweight user-defined software stacks for high-performance computing.
https://hpc.github.io/charliecloud
Apache License 2.0
312 stars 61 forks source link

MPI Slurm interaction #1010

Closed DavidBrayford closed 3 years ago

DavidBrayford commented 3 years ago

Hi,

I've been trying to execute an MPI program with OpenMPI across several nodes on a HPC running Slurm (salloc) from within a container using the following command:

ch-run -w ./test_mpi_image -- mpiexec -n 2 /ALPACA

(This command executes successfully when not using Slurm)

And get the following error:

The SLURM process starter for OpenMPI was unable to locate a usable "srun" command in its path. Please check your path and try again.


An internal error has occurred in ORTE:

[[56714,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

Do you have any documentation on how to configure Slurm to avoid this error?

Would binding the system Slurm executables and libraries in the the container using the -b command resolve this issue?

As I am running on a production system I am limited on what experimentation I can do.

heasterday commented 3 years ago

Hello David,

There are two main ways to launch MPI applications with Charliecloud that we refer to as either a "host" or "guest" launch. A host launch is where the parallel launcher is used to launch multiple containers (that usually join together into a shared namespace). A guest launch is where the parallel launcher used within the container to launch the application. They are of the following forms:

A limitation of the guest launch approach is that it is usually single node only because parallel launcher within the container doesn't know how to launch containers on other nodes.

As for your error, I believe what you are running into is the fact that the MPI install within the container is seeing Slurm variables in your environment that are making it think that it can use Slurm mechanisms to launch processes. To workaround this we suggest folks add --unset-env=SLURM* for guest launches. So something like this: ch-run -w --unset-env=SLURM* ./test_mpi_image -- mpiexec -n 2 /ALPACA

That being said, you mentioned you want to run across several nodes, so I would recommend a host launch. Using the example you provided this would look something like: srun -N 2 -n 2 ch-run -w ./test_mpi_image -- /ALPACA. Please note that for this to work you need your MPI install in the container to be PMI aware. Our example dockerfile may be useful towards this end. Please also note that once you are running more than one container per node you likely will want to use the --join flag to prevent MPI failures.

Let me know if this does/doesn't help 😃

DavidBrayford commented 3 years ago

Hi Heasterday,

I am unable to execute srun correctly, due to the host system Slurm/Munge configuration. The application sort of completes but a lot of Slurm, Munge and MPI error messages are generated, so not confident the results are valid.

I normally execute mpiexec -n 2 -w ./test_mpi_image -- /executable but the problem is that if I take e.g. 2 ranks, my application is run twice with a single rank. I.e. I get twice the outputfolders/data (and not one twice-as-fast application).

I need MPI to spread my application on different nodes. Not just for compute parallelism, but also due to the RAM requirements.

Any suggestions on how I distribute my containerized application across multiple nodes.

David

heasterday commented 3 years ago

Do you believe the Slurm/Munge/MPI errors are an incompatibility between the container MPI and the host MPI, or are they present for non-containerized applications as well? If it's the former I may be able to give you some things to try.

Using mpirun/mpiexec to launch a container on every node is more complex because you won't have the PMI compatibility layer and so the host MPI install will need to be as close to the container install as possible.

What outcome do you get if you launch with the following command line: mpiexec -n 2 --map-by node ch-run --unset-env=SLURM* -w ./test_mpi_image -- /executable?

Could you point me to a Dockerfile for how you built MPI for your image?

DavidBrayford commented 3 years ago

The original Dockerfile that my colleague created, which I modified to include libpmi:

`FROM ubuntu:latest AS buildstage ENV DEBIAN_FRONTEND=noninteractive COPY Alpaca /AlpacaCode WORKDIR /AlpacaCode RUN apt-get update && apt-get install -y build-essential make apt-utils cmake g++ libpmi0-dev libpmi-pmix-dev libpmi2-0 libpmi2-0-dev libopenmpi-dev openmpi-bin libhdf5-openmpi-dev mlocate RUN mkdir build && cd build && cmake -DDIM=1 .. && make FROM ubuntu:latest AS runstage ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update \ && apt-get -y install build-essential make apt-utils cmake g++ libpmi0-dev libpmix-dev libpmi2-0 libpmi2-0-dev libopenmpi-dev openmpi-bin mlocate libhdf5-openmpi-103 mlocate

USER runner

COPY --from=buildstage /AlpacaCode/build/ALPACA . COPY --from=buildstage /AlpacaCode/inputfile.xml . RUN mkdir -p /lrz/sys RUN mkdir -p /dss/dsshome1 ` I am in the process of following the openMPI example Dockerfile from the git repo, but have encountered a few cmake errors, which I am looking into.

I was able to execute ch-run .... mpiexec .... successfully without Slurm, but when I tried it on a system with Slurm using salloc I got lots of Slurm errors.

David

David

heasterday commented 3 years ago

Thank you for the example Dockerfile, I will build a simple mpi app with it and see what it takes to run on our systems.

Something to note on our example dockerfile, it does assume that our CentOS 8 Dockerfile is being used as its base. The big things we do in that image are install general dependencies and add to the search path for the linker.

I will let you know what I find from testing with your Dockerfile.

heasterday commented 3 years ago

Using the provided Dockerfile I was able to build Intel's IMB benchmark and run it across two nodes on our Slurm cluster.

Some things to note:

David, please look at the errors I provided and their workarounds and let me know if they are relevant to your environment. NOTE: I didn't evaluate performance at all because this is likely very site dependent.

DavidBrayford commented 3 years ago

I couldn't resolve the Slurm isssues with OpenMPI, so tried using MPICH (Intel MPI) and I no longer gette slurm errors. However, I am getting errors related to the bootstrap proxies:

I am using the system version of MPI via binding and I get the same problem even if I execute mpiexec -n 2 ch-run -w image_mpich -- /executable

[mpiexec@cm2devel] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cm2devel (pid 6559, exit code 256) [mpiexec@cm2devel] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error [mpiexec@cm2devel] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error [mpiexec@cm2devel] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event [mpiexec@cm2devel] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1938): error setting up the boostrap proxies The system has infiniband and I've tried explicitly setting the UCX_TLS parameters, but still get the same error.

Do I need to install the infiniband drivers inside the container? What configuration options do you recommend to be set?

David

heasterday commented 3 years ago

Could I get a copy of the Slurm errors you were getting so I can look into them? Also, were these errors generated using our example OpenMPI base or the Dockerfile you provided? I would be very interested in the errors from both for comparison.

Typically we recommend, where possible, building the MPI install in the container with the desired communication library (UCX, Libfabric, etc..) and all its dependencies. My guess is that something required for the libraries you are binding in is missing.

A shot in the dark, would it be possible for me to get a guest account on some platform with a similar configuration? The thought is that I could then test if something needs to be done differently at build/runtime for an image in your environment vs ours.

DavidBrayford commented 3 years ago

The OpenMPI version gave errors regarding unable to find libpmi.so.1 which is located on the host in /usr/lib64 normally I would bind the directory into the container. I could create a new directory on the host and populate it with symlinks, but don't want to go down that path if possible.

Unfortunately, I can't provide access to the system.

The system prefered MPI version is intel and OpenMPI isn't well supported, so want to focus on MPICH (Intel MPI versions 2019.7.217 and 2019.8.254). Also, ucx isn't installed on the system, uses the environment settings IMPI*

Default setting from the module system: I_MPI_HYDRA_BOOTSTRAP slurm I_MPI_PLATFORM hsw I_MPI_PIN_DOMAIN auto I_MPI_FABRICS shm:ofi I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS --ntasks-per-node=1 I_MPI_HYDRA_BRANCH_COUNT 128 I_MPI_OFI_PROVIDER verbs

I've tried setting FI_PROVIDER=tcp and I_MPI_FABRICS=tcp but still getting the error:

[mpiexec@i22r07c05s06] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on i22r07c05s06 (pid 20924, exit code 65280) [mpiexec@i22r07c05s06] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error [mpiexec@i22r07c05s06] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error [mpiexec@i22r07c05s06] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event [mpiexec@i22r07c05s06] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1938): error setting up the boostrap proxies

I will try it on another system and get back to you.

David

DavidBrayford commented 3 years ago

For clarification, if I execute mpiexec -n 2 ch-run container -- mpi_app The container is replicated on both nodes and executes the same application twice rather than distribute the single parallel application across the 2 nodes.

As I am able to successfully execute mpiexec -n 2 container -- mpi_app but it executes the job twice (job replicates itself on nodes 1 & 2) rather than distributing the job.

Is this correct?

heasterday commented 3 years ago

You could bind in libpmi from the host but I would recommend the container image having that already. On that note, the recommended way to inject a host install of libraries is ch-fromhost which we currently use to inject cray-mpich, I can look into extending this functionality to intel-mpi (likely a very similar process) if that interests you?

Re: mpiexec behavior:
That command uses the parallel launcher's standard mechanisms for starting up a process on each node (in this case a container), the containers then perform their setup process and then execvp the mpi_app, the mpi applications then follow standard mechanisms to wire up. If something about the MPI wire up fails you can see two independent applications rather than one, this is not a behavior indicating success.

reidpr commented 3 years ago

We're trying out the new “Discussions” feature, so I am going to move this thread to that section. Please LMK if anything goes wrong.