open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

Getting stuck on MPI_FInalize() when using ULFM #11404

Open rcoacci opened 1 year ago

rcoacci commented 1 year ago

As discussed in the ULFM mailling list: https://groups.google.com/g/ulfm/c/2VRCwoEyj0M/m/0Dsf8OvZAAAJ

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Main branch at 68395556ce

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from git clone of main.

Please describe the system on which you are running


Details of the problem

I'm currently trying UFLM from OpenMPI from the main branch (specifically commit 68395556ce), and while running on a single node everything works fine, as soon as I add another node, the living processes gets stuck on MPI_Finalize().

The test program I'm using is a variant (with more printf's basically) of https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/02.err_handler.c.

The cluster in question is a production/development cluster that has Infiniband, GPU, and Ethernet but I didn't enable UCX on the openmpi install (leaving CUDA, as seen on ompi_info.txt), and it seems to be using tcp without problem. I tried forcing it to use the ethernet interface (via btl_tcp_if_include) but had the same results. I'm running it through the cluster slurm instalation (unfortunately its 20.11.9, as you probably know that's harder to change on a production cluster) using sbatch and the following mpirun commad line:

mpirun --with-ft ulfm --display-comm --display-comm-finalize  err_handler

The --display-comm parameters assure me that it's using tcp for communication between nodes;

After some more testing I found out that disabling shared memory (with --mca btl ^sm) makes the living processes exit (no one gets stuck at MPI_Finalize()) but the job never finishes and prted/prterun/srun processes keep running depending on the node.

So it seems there are maybe two issues here: one related to the sm component, and the other related to slurm/prted/prterun.

rcoacci commented 1 year ago

Attaching ompi_info. ompi_info.txt

daa4453 commented 8 months ago

FYI, the SC22 programs all work with these options, but only on ONE NODE:

OPTIONS="--with-ft ulfm --map-by :oversubscribe --mca btl tcp,self"

Could it be that UFLM only works on one node?

That wasn't my understanding, but I can find no examples anywhere that of anyone using ULFM on more than one node...

Perhaps someone from that project would take a look at this...

daa4453 commented 8 months ago

I should add that I tried --mca pml ob1 as well, with no change in behavior.

And I tried with 5.0.2 and nightly source drop from 3-14, which I think includes the PMIx/PRRTE update.