Open goduck777 opened 3 years ago
@dmitrygx this is still relevant, can close it?
@dmitrygx this is still relevant, can close it?
yes, we can close it
@dmitrygx this is still relevant, can close it?
yes, we can close it
ah, we still have some problem - https://github.com/openucx/ucx/issues/7316#issuecomment-912972449 need to investigate it
Background information
The function of MPI_Comm_Spawn with multiple MPI processes can lead to an error
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I have tried v4.1.x, v5.0.x, and master For UCX, I have tried 1.10 and 1.11
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from a source tarball of daily snapshot
Please describe the system on which you are running
Details of the problem
With the latest version of OpenMPI and UCX, the function of MPI_Comm_Spawn can lead to an UCX errror when OpenMPI is running with np larger than 1.
One can run a simple test to reproduce this. This is from #6833.
Parent code:
Worker code:
Makefile:
If I run
mpirun --oversubscribe -n 1 ./main
, I got the correct result.If I run
mpirun --oversubscribe -n 2 ./main
, I got the following error.This works in the old OpenMPI (4.0.x) with UCX 1.8.