Open rcoacci opened 1 year ago
Attaching ompi_info. ompi_info.txt
FYI, the SC22 programs all work with these options, but only on ONE NODE:
OPTIONS="--with-ft ulfm --map-by :oversubscribe --mca btl tcp,self"
Could it be that UFLM only works on one node?
That wasn't my understanding, but I can find no examples anywhere that of anyone using ULFM on more than one node...
Perhaps someone from that project would take a look at this...
I should add that I tried --mca pml ob1 as well, with no change in behavior.
And I tried with 5.0.2 and nightly source drop from 3-14, which I think includes the PMIx/PRRTE update.
As discussed in the ULFM mailling list: https://groups.google.com/g/ulfm/c/2VRCwoEyj0M/m/0Dsf8OvZAAAJ
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Main branch at 68395556ce
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from git clone of main.
Please describe the system on which you are running
Details of the problem
I'm currently trying UFLM from OpenMPI from the main branch (specifically commit 68395556ce), and while running on a single node everything works fine, as soon as I add another node, the living processes gets stuck on MPI_Finalize().
The test program I'm using is a variant (with more printf's basically) of https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/02.err_handler.c.
The cluster in question is a production/development cluster that has Infiniband, GPU, and Ethernet but I didn't enable UCX on the openmpi install (leaving CUDA, as seen on ompi_info.txt), and it seems to be using tcp without problem. I tried forcing it to use the ethernet interface (via btl_tcp_if_include) but had the same results. I'm running it through the cluster slurm instalation (unfortunately its 20.11.9, as you probably know that's harder to change on a production cluster) using sbatch and the following mpirun commad line:
The --display-comm parameters assure me that it's using tcp for communication between nodes;
After some more testing I found out that disabling shared memory (with --mca btl ^sm) makes the living processes exit (no one gets stuck at MPI_Finalize()) but the job never finishes and prted/prterun/srun processes keep running depending on the node.
So it seems there are maybe two issues here: one related to the sm component, and the other related to slurm/prted/prterun.