Closed wzamazon closed 3 years ago
I did some investigation, and found the reason for this error is because when osc/rdma set up the shared memory access, the shm filename is not unique between communicator.
The background is test_dan1
will split the whole world into multiple communicators: each communicators has 2 ranks.
osc/rdma will set up shared memory between ranks in the same communicator on the same node: one rank on each communicator on each node will create a shm file (this rank is called the local leader). Other ranks in the same communicator and on the same node will attach to the shm file.
The shm file name is:
<directory>/osc_rdma.<nodename>.<job_id>.<ompi_comm_get_cid>
(which happened here)
Apparently, ompi_comm_get_cid
is not unique. The communicators generated from the same MPI_Comm_split
call has the same ompi_comm_get_cid
. Therefore, ranks in difference communicator end up using the same shm file name, which caused the problem.
I test to use the pid of the local leader to replace ompi_comm_get_cid
, which fixed the issue.
I wonder whether it is expected that ompi_comm_get_cid
return same cid for different communicator? If it is expected, I think we can should use local leader's pid in shm file name. Otherwise, I can dig into it more.
the cid is unique from a process point of view. This means that two communicators having different participants can have the same name. You fix (adding the pit of the local leader) will make them unique at the node level, because the same leader will not be able to be part of two distinct communicators with the same cid.
I see. Thanks!
Because shm filename must be unique on node level, I think using pid is more appropriate then using cid.
I will open a PR to fix it then.
merged and back ported.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
master branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
compiled from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.409d89fd2c6ad47e6cab4d4f18e72dd3e0af2e70 3rd-party/openpmix (v1.1.3-3138-g409d89fd) 8622c287399f3a22102b998aee098d97a621bdbe 3rd-party/prrte (psrvr-v2.0.0rc1-4018-g8622c28739)
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
Running
ompi-tests/onesided/test_dan1
will fail on single node when comm_size >=4.Command line
error message:
The same test can pass when comm size is 2.