dstore issue during large job launch

bosilca commented 4 years ago

Background information

We were running some scale tests with Open MPI 4.0.3, using PMIx3. All small scale jobs, up to 512 nodes (single process per node) were successful, but most of the larger one fail due to some resource exhaustion.

What version of the PMIx Reference Library are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

v3.0 as shipped with Open MPI 4.0.3.

Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Release installation from OMPI, no additional flags. Compiled with Intel 19.

Please describe the system on which you are running

Operating system/version: Linux
Computer hardware: AMD EPYC

Details of the problem

When using the default options, 1k nodes jobs randomly (but mostly) failed, and 4k nodes jobs never passed PMIx_Init. The error report looked similar to the following.

[XXX:41004] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c at line 231
[[XXX:41004] OPAL ERROR: Error in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_segment.c at line 207
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c at line 658
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c at line 1850
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c at line 2808
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/common/dstore/dstore_base.c at line 2857
[XXX:40994] PMIX ERROR: OUT-OF-RESOURCE in file /home/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 3408

Digging into the PMIx 3 code, the only way to have an OUT-OF_RESOURCE in common/dstore/dstore_segment.c:207 is to fail at creating the shm segment (pshmem_mmap.c:79) due to posix_fallocate. So either no space on the device hosting temporary files or the requested size was unreasonable.

We confirmed that all nodes had enough space in /dev/shm and /tmp, and that the number of open file descriptor was set high enough (> 4k). Moreover, while chatting with @rhc54 he suggested to force the PMIX_MCA_gds to hash. Doing so allowed all our 1k submissions to execute. As a result it seems the issue cannot be attributed to any system configuration, but is coming from PMIx, more precisely from the ds12 DGS component.

rhc54 commented 4 years ago

@artpol84 @karasevb We also tried setting PMIX_MCA_gds=^ds12, but that also failed though with a different signature:

[57770,0],64] FORCE-TERMINATE AT Not found:-13 - error /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/base/grpcomm_base_stubs.c(355)This is something that should be reported to the developers.
r5c3t8n3:185389] [[57770,0],64] ORTE_ERROR_LOG: Not found in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 278
[r5c3t8n3:185389] [[57770,0],64] ORTE_ERROR_LOG: Not found in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/direct/grpcomm_direct.c at line 187
[r1c1t1n1:60691] PMIX ERROR: ERROR in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99
[r1c1t1n1:60691] PMIX ERROR: ERROR in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99

@bosilca is going to try to get some additional time on the machine to test OMPI master so we can see if this problem persists on PMIx master. Pending that resolution, we should consider this as a blocker for PMIx v4 release, and an indicator that we definitely need a PMIx v3.2 release.

jjhursey commented 4 years ago

@bosilca Did you all see this on other machines (e.g., Summit) with a similar build? From the path in the output, it looks like you compiled with the intel compiler. Is that accurate?

FYI I'd like to talk about this ticket on the developer teleconf later today.

bosilca commented 4 years ago

Just to be clear during these tests I did not have the leisure to compiled my own version, I used what was made available by the system. On Summit we were using spectrum-mpi/10.3.1.2-20200121 and we did not encounter any issues related to the startup (I emailed you about the issues we got there).

And yes in this particular instance OMPI was compiled with Intel compiler (19.1.0.166).

jjhursey commented 4 years ago

Thanks for that note. Spectrum MPI on Summit is running PMIx 3.1.4. So it might be something between 3.1.4 and what is in OMPI 4.0.3 (which I think is 3.1.5).

naughtont3 commented 4 years ago

Quick follow-up from last week, I did 513 node test on Summit with ompi-4.0.3 build using IMB Barrier and it passed without a problem. Only a one off data point, but wanted to mention.

I used GCC toolchain and ucx-1.7.0 (self built).

naughtont3 commented 4 years ago

I did 1025 node test on Summit with ompi-4.0.3 build using IMB-barrier (ppr:42:node for max of 43,050 ranks total) and ran w/o problem.

bwbarrett commented 4 years ago

I ran 50 or so tests on a cluster of 1024 1-core virtual machines. I did not see any hard failures. However, I did see the following on exactly 1 run, which seems very weird:

--------------------------------------------------------------------------
The pmix_mca_base_component_path MCA variable was used to add paths to
search for PMIX components.  At least one directory failed to add
properly:

    /home/ec2-user/.pmix/compone

Check to make sure that this directory exists, is readable, etc.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The pmix_mca_base_component_path MCA variable was used to add paths to
search for PMIX components.  At least one directory failed to add
properly:

    ts

Check to make sure that this directory exists, is readable, etc.
--------------------------------------------------------------------------

I doubt it has anything to do with this ticket, unless there's some weird race with memory somewhere?

naughtont3 commented 4 years ago

I ran osu_init in a loop 50 time over 1025 nodes (1 rank per node, ppr:1:node) with ompi-4.0.3 on Summit and had no problems.

bosilca commented 4 years ago

Apparently this issue is specific to the environment I was running on. I can't get access to the machine right now (and I will need a large allocation anyway), so I will downgrade this to minor but keep it alive until I will be able to run more tests.

However, it would be good if I have a precise plan of what to run and how to run it, to be able to gather as much info as possible. So, what MCA should I try, what verbose to set to be able to get enough info to allow us to understand and hopefully fix this ?

naughtont3 commented 4 years ago

For ticket trails, here are the run bits for my osu_init test:

[naughton@login3.summit osu-micro-benchmarks-5.6.2]$ env | grep UCX
UCX_DIR=/sw/summit/ums/ompix/gcc/6.4.0/install/ucx-1.7.0
UCX_INSTALL_DIR=/sw/summit/ums/ompix/gcc/6.4.0/install/ucx-1.7.0
UCX_MAX_RNDV_RAILS=2
UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1
[naughton@login3.summit osu-micro-benchmarks-5.6.2]$ env | grep MCA
OMPI_MCA_routed=direct
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_rmaps_base_no_schedule_local=1
OMPI_MCA_plm_rsh_no_tree_spawn=1
OMPI_MCA_io=romio321

    mpirun \                                                                                                        
        --mca btl ^openib \                                                                                         
        -np $nproc \                                                                                                
        --nolocal \                                                                                                 
        --hostfile $LSB_DJOB_HOSTFILE \                                                                             
        --map-by ppr:$ppr:node \                                                                                    
        --bind-to core \                                                                                            
        -x PATH \                                                                                                   
        -x LD_LIBRARY_PATH \                                                                                        
        $OSU_PATH/osu_init

karasevb commented 4 years ago

@bosilca could you clarify please, what is your ulimit for filesize?

I verified that on AMD EPYC with 128ppn job and ds21 gds component requires a limit no less than 20480 for file size.

bosilca commented 4 years ago

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1027595
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 300000
cpu time (seconds, -t) unlimited
max user processes (-u) 1027595
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

James-A-Clark commented 3 years ago

Hi @rhc54, did this turn out to be a configuration issue or require a code change to fix in the end?

I'm also seeing the same out of resource issue:

PMIX ERROR: ERROR in file dstore_segment.c at line 207
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 661
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 1857
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 2846

This is with OpenMPI 4.0.6-rc1 and PMIx 3.2.2.

Ulimit looks like this:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1028702
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10485760
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The issue is happening intermittently with large jobs of 500-1000 nodes, but it never happens with smaller job sizes. It also only happens when launching with a debugger, without a debugger, launching always succeeds.

This could suggest that the debugger is using extra memory, and allocating memory is where PMIx seems to be failing. But the ulimit output suggests this shouldn't be an issue.

rhc54 commented 3 years ago

We were never able to replicate it, even on large runs. It's hard to believe you'd be hitting a true memory limit, so I suspect it is something else that is causing the problem. Can you provide any details on what you are doing - e.g., how you are launching with a debugger?

James-A-Clark commented 3 years ago

After digging a bit more, I think the true memory limit might be what we're seeing as well because it's 128 processes on a single node. I will have to eliminate that possibility first and then will come back here if I still think it's PMIx related. Thanks for the reply.

(Edit:The reason for the memory limit being hit when debugging is that there would be an instance of all the debug symbols held in memory for each process. So with 128 that can add up quite quickly)

rhc54 commented 3 years ago

Might see what happens if you use an app like /bin/true which wouldn't have all the debug symbols. Let us know what you find either way as we'd really like to understand what is going on in this case.

James-A-Clark commented 3 years ago

I don't know if this is a red herring or not, but we fixed our startup issue by deleting the contents of /tmp. The first startup always worked, but subsequent ones would fail with OUT-OF-RESOURCE until we cleared the OpenMPI files in /tmp.

I've attached the zip of what was left behind by the first run: tmp.tar.gz

We were no where near the memory limit on either working or not working startup cases, using about 80GB out of 256GB.

rhc54 commented 3 years ago

@artpol84 @karasevb @jjhursey Looks like we have a cleanup issue in the v3.2 branch that is a significant problem for OMPI. Can someone take a look at it? Please see recent comments from @James-A-Clark above.

Note this critical caveat (not sure how it causes the problem):

It also only happens when launching with a debugger, without a debugger, launching always succeeds.

openpmix / openpmix