open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

master branch ompi-tests/oneside/test_dan1 fail on single node when comm size >= 4 #9492

Closed wzamazon closed 3 years ago

wzamazon commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

master branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

compiled from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

409d89fd2c6ad47e6cab4d4f18e72dd3e0af2e70 3rd-party/openpmix (v1.1.3-3138-g409d89fd) 8622c287399f3a22102b998aee098d97a621bdbe 3rd-party/prrte (psrvr-v2.0.0rc1-4018-g8622c28739)

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Running ompi-tests/onesided/test_dan1 will fail on single node when comm_size >=4.

Command line

mpirun -np 4 ~/ompi-tests/onesided/test_dan1 

error message:

================ test_dan1 ========== Thu Oct  7 18:36:58 2021

================ accesses ========== Thu Oct  7 18:36:58 2021

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  ip-172-31-26-148
  System call: unlink(2) /dev/shm/osc_rdma.ip-172-31-26-148.d1c20001.4
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
   starting puts for 0.010000 seconds ...t_base is: 1633631818704518000
t_base is: 1633631818704518000
t_base is: 1633631818704510000
t_base is: 1633631818704510000
 done.
max busyratios for 2 pairs: origin = 0.923, target = 0.800
================ get_bandwidth ========== Thu Oct  7 18:36:58 2021

t_fence is: 1633631818713915000
t_fence is: 1633631818713921000
t_fence is: 1633631818713912000
t_fence is: 1633631818713918000
msgsize, bytes   iterations        bandwidth, Mbytes/s
                               max        min        ave
         1          1000        0.5        0.5        0.5
         2          1000        1.1        1.1        1.1
         4          1000        2.2        2.2        2.2
         8          1000        4.6        4.6        4.6
        16          1000        8.9        8.9        8.9
        32          1000       17.9       17.9       17.9
        64          1000       35.2       35.2       35.2
       128          1000       72.2       72.2       72.2
       256          1000      138.8      138.8      138.8
       512          1000      273.9      273.2      273.6
      1024          1000      494.0      494.0      494.0
      2048          1000      923.4      922.1      922.7
      4096          1000     1526.1     1525.0     1525.5
      8192          1000     2384.2     2384.2     2384.2
     16384           640     3488.3     3487.2     3487.7
     32768           320     4957.9     4957.9     4957.9
     65536           160     6201.0     6193.7     6197.4
    131072            80     7292.0     7286.9     7289.5
    262144            40     7837.1     7831.2     7834.1
    524288            20     8047.6     8047.6     8047.6
   1048576            10     8211.4     8198.5     8204.9

MPI_Get aggregate bandwidth, 1048576 bytes, 2 pairs = 16409.89 Mbytes/s
================ p_to_p bandwidth ========== Thu Oct  7 18:36:58 2021

# Test: p_to_p
# Thu Oct  7 18:36:58 2021
#     SIZE  BW(MB/SEC)
================ p_to_p latency ========== Thu Oct  7 18:36:58 2021

   1048576   12865.986
# Test: latency
# Times are in microseconds
PAIRS      SIZE       MIN       AVG       MAX
    2         0       2.5       3.5       6.5
================ post_wait ========== Thu Oct  7 18:36:58 2021

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  ip-172-31-26-148
  System call: unlink(2) /dev/shm/osc_rdma.ip-172-31-26-148.d1c20001.4
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

The same test can pass when comm size is 2.

wzamazon commented 3 years ago

I did some investigation, and found the reason for this error is because when osc/rdma set up the shared memory access, the shm filename is not unique between communicator.

The background is test_dan1 will split the whole world into multiple communicators: each communicators has 2 ranks.

osc/rdma will set up shared memory between ranks in the same communicator on the same node: one rank on each communicator on each node will create a shm file (this rank is called the local leader). Other ranks in the same communicator and on the same node will attach to the shm file.

The shm file name is:

    <directory>/osc_rdma.<nodename>.<job_id>.<ompi_comm_get_cid>

(which happened here)

Apparently, ompi_comm_get_cid is not unique. The communicators generated from the same MPI_Comm_split call has the same ompi_comm_get_cid. Therefore, ranks in difference communicator end up using the same shm file name, which caused the problem.

I test to use the pid of the local leader to replace ompi_comm_get_cid, which fixed the issue.

I wonder whether it is expected that ompi_comm_get_cid return same cid for different communicator? If it is expected, I think we can should use local leader's pid in shm file name. Otherwise, I can dig into it more.

bosilca commented 3 years ago

the cid is unique from a process point of view. This means that two communicators having different participants can have the same name. You fix (adding the pit of the local leader) will make them unique at the node level, because the same leader will not be able to be part of two distinct communicators with the same cid.

wzamazon commented 3 years ago

I see. Thanks!

Because shm filename must be unique on node level, I think using pid is more appropriate then using cid.

I will open a PR to fix it then.

wzamazon commented 3 years ago

Opened https://github.com/open-mpi/ompi/pull/9495

wzamazon commented 3 years ago

merged and back ported.