open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

vader transport appears to leave SHM files laying around after successful termination #7220

Closed mwheinz closed 2 years ago

mwheinz commented 4 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Packaged with Intel OPA 10.10.0.0.445

Please describe the system on which you are running

Back-to-back Xeon systems running RHEL 7.6 on one and RHEL 8.0 on the other.


Details of the problem

I was using OMPI to do some stress testing of some minor changes to the OPA PSM library, when I discovered that the vader transport appears to be leaking memory mapped files.

I wrote a bash script to run the OSU micro benchmarks in a continuous loop, alternating between using the PSM2 MTL and the OFI MTL. After a 24 hour run, I ran into some "resource exhausted" issues when trying to start new shells, execute shell scripts, etc..

Investigating, I found over 100k shared memory files in /dev/shm, all of the form vader_segment.<hostname>.<hex number>.<decimal number>

It's not clear at this point that the shared memory files are the cause of the problems I had, but they certainly shouldn't be there!

Sample run lines:

mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr
mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl psm2 -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr

Script that was used to run the benchmarks:


#!/bin/bash

# mpirun --mca mtl_base_verbose 10 --mca osc pt2pt --allow-run-as-root --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -np 2 -H hdsmpriv01,hdsmpriv02 $PWD/IMB-EXT accumulate 2>&1 | tee a

OPTS1="--mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd"
OPTS2="--mca osc pt2pt --mca pml cm --mca mtl psm2"
HOSTS="-H hdsmpriv01,hdsmpriv02"
N=48

TEST_PAIR=(./mpi/pt2pt/osu_bw
    ./mpi/pt2pt/osu_bibw
    ./mpi/pt2pt/osu_latency_mt
    ./mpi/pt2pt/osu_latency
    ./mpi/one-sided/osu_get_latency
    ./mpi/one-sided/osu_put_latency
    ./mpi/one-sided/osu_cas_latency
    ./mpi/one-sided/osu_get_acc_latency
    ./mpi/one-sided/osu_acc_latency
    ./mpi/one-sided/osu_fop_latency
    ./mpi/one-sided/osu_get_bw
    ./mpi/one-sided/osu_put_bibw
    ./mpi/one-sided/osu_put_bw
)
TEST_FULL=(
    ./mpi/pt2pt/osu_mbw_mr
    ./mpi/pt2pt/osu_multi_lat
    ./mpi/startup/osu_init
    ./mpi/startup/osu_hello
    ./mpi/collective/osu_allreduce
    ./mpi/collective/osu_scatter
    ./mpi/collective/osu_iallgatherv
    ./mpi/collective/osu_alltoallv
    ./mpi/collective/osu_ireduce
    ./mpi/collective/osu_alltoall
    ./mpi/collective/osu_igather
    ./mpi/collective/osu_allgatherv
    ./mpi/collective/osu_iallgather
    ./mpi/collective/osu_reduce
    ./mpi/collective/osu_ialltoallv
    ./mpi/collective/osu_ibarrier
    ./mpi/collective/osu_ibcast
    ./mpi/collective/osu_gather
    ./mpi/collective/osu_barrier
    ./mpi/collective/osu_iscatter
    ./mpi/collective/osu_scatterv
    ./mpi/collective/osu_igatherv
    ./mpi/collective/osu_allgather
    ./mpi/collective/osu_ialltoall
    ./mpi/collective/osu_ialltoallw
    ./mpi/collective/osu_reduce_scatter
    ./mpi/collective/osu_iscatterv
    ./mpi/collective/osu_gatherv
    ./mpi/collective/osu_bcast
    ./mpi/collective/osu_iallreduce)

while true; do
    echo "------------------------"
    date
    echo "------------------------"
    for t in ${TEST_PAIR[@]}
    do
        CMD="mpirun --allow-run-as-root -np 2 ${OPTS1} ${HOSTS} ${t}"

        echo "${CMD}"

        eval ${CMD}

        CMD="mpirun --allow-run-as-root -np 2 ${OPTS2} ${HOSTS} ${t}"

        echo "${CMD}"

        eval ${CMD}
    done
    for t in ${TEST_FULL[@]}
    do
        CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS1} ${HOSTS} ${t}"

        echo "${CMD}"

        eval ${CMD}

        CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS2} ${HOSTS} ${t}"

        echo "${CMD}"

        eval ${CMD}
    done
    sleep 60
done
mwheinz commented 4 years ago

Looks like this problem does not exist in 4.0.2. I haven't figured out which commit corrects the issue, however.

mwheinz commented 4 years ago

Looks like this is the known issue #6565. Fix is in master and 4.0.2 but not in the 3.1.x branch.

This does not seem to be true.

mwheinz commented 4 years ago

Okay - I tried backporting the patch from #6565 because it fit much of the description, but it does not actually fix the problem for 3.1.4. I tried testing 3.1.5 but failed to build it due to the GLIBC_PRIVATE issue.

maxhgerlach commented 4 years ago

Running MPI processes under Open MPI 4.0.2, I noticed that those /dev/shm/vader_segment.* files stick around after theses processes are terminated via SIGTERM.

That sounds like a memory leak waiting to cause more severe issues.

hjelmn commented 4 years ago

These files are supposed to be cleaned up by PMIx. Not sure why that isn't happening in this case.

jsquyres commented 4 years ago

FWIW: we discussed this on the weekly OMPI call today:

  1. Open MPI >= v4.0.x uses PMIX 3.x, which has a "register to do something at job shutdown" hook. Hence, in Open MPI master and >= v4.0, we shouldn't be seeing these leftover files. If we are, it's a bug.
  2. Open MPI < v4.0x uses PMIX 2.x, which does not have the "register to do something at job shutdown" hook. @hjelmn today said he'd look at the logic / workarounds we were supposed to have in place in the v3.0.x / v3.1.x trees and make sure those were working as best as they can work.
rhc54 commented 4 years ago

I examined OMPI v4.0.2 and it appears to be doing everything correctly (ditto for master). I cannot see any reason why it would be leaving those files behind. Even the terminate-by-signal path flows thru the cleanup.

No real ideas here - can anyone replicate this behavior? I can't on my VMs - it all works correctly.

mkre commented 4 years ago

@rhc54, I can confirm it's working with 4.0.2. However, I can reliably reproduce the behavior using 3.1.x.

I think it's the same underlying I'm running into in #7308: If another user left behind a segment file and there is a segment file name conflict with my current job, the run will abort with "permission denied" as the existing segment file can't be opened.

As @jsquyres pointed out, it seems like it is an issue with PMIx 2.x. While @hjelmn is looking into possible workarounds, I'm wondering if we can use PMIx 3.x with Open MPI 3.1.5?

maxhgerlach commented 4 years ago

Sorry for the confusion: It was a bug in our setup. I can now confirm that /dev/shm/vader* files are cleaned up after SIGTERM in Open MPI 4.0.2.

awlauria commented 2 years ago

@mwheinz can you check to see if: https://github.com/open-mpi/ompi/pull/10040

fixes this issue for you? I noticed the same thing on master recently.

awlauria commented 2 years ago

I miss-read this issue. It appears it only happens in the mpi v3 series - which is frozen. Since it is fixed in v4 and beyond, this should probably be closed.

I confirmed that #10040 is a master/v5 regression - it works on v4/4.1.

v5.0.x pr: https://github.com/open-mpi/ompi/pull/10046