Closed bosilca closed 3 years ago
@artpol84 @karasevb We also tried setting PMIX_MCA_gds=^ds12
, but that also failed though with a different signature:
[57770,0],64] FORCE-TERMINATE AT Not found:-13 - error /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/base/grpcomm_base_stubs.c(355)This is something that should be reported to the developers.
r5c3t8n3:185389] [[57770,0],64] ORTE_ERROR_LOG: Not found in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 278
[r5c3t8n3:185389] [[57770,0],64] ORTE_ERROR_LOG: Not found in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/orte/mca/grpcomm/direct/grpcomm_direct.c at line 187
[r1c1t1n1:60691] PMIX ERROR: ERROR in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99
[r1c1t1n1:60691] PMIX ERROR: ERROR in file /zhome/academic/HLRS/hlrs/hpcoft14/work/mpi/openmpi/4.0.3-intel-19.1.0.166/openmpi-4.0.3/opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c at line 99
@bosilca is going to try to get some additional time on the machine to test OMPI master so we can see if this problem persists on PMIx master. Pending that resolution, we should consider this as a blocker for PMIx v4 release, and an indicator that we definitely need a PMIx v3.2 release.
@bosilca Did you all see this on other machines (e.g., Summit) with a similar build? From the path in the output, it looks like you compiled with the intel compiler. Is that accurate?
FYI I'd like to talk about this ticket on the developer teleconf later today.
Just to be clear during these tests I did not have the leisure to compiled my own version, I used what was made available by the system. On Summit we were using spectrum-mpi/10.3.1.2-20200121 and we did not encounter any issues related to the startup (I emailed you about the issues we got there).
And yes in this particular instance OMPI was compiled with Intel compiler (19.1.0.166).
Thanks for that note. Spectrum MPI on Summit is running PMIx 3.1.4. So it might be something between 3.1.4 and what is in OMPI 4.0.3 (which I think is 3.1.5).
Quick follow-up from last week, I did 513 node test on Summit with ompi-4.0.3 build using IMB Barrier and it passed without a problem. Only a one off data point, but wanted to mention.
I used GCC toolchain and ucx-1.7.0 (self built).
I did 1025 node test on Summit with ompi-4.0.3 build using IMB-barrier (ppr:42:node for max of 43,050 ranks total) and ran w/o problem.
I ran 50 or so tests on a cluster of 1024 1-core virtual machines. I did not see any hard failures. However, I did see the following on exactly 1 run, which seems very weird:
--------------------------------------------------------------------------
The pmix_mca_base_component_path MCA variable was used to add paths to
search for PMIX components. At least one directory failed to add
properly:
/home/ec2-user/.pmix/compone
Check to make sure that this directory exists, is readable, etc.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The pmix_mca_base_component_path MCA variable was used to add paths to
search for PMIX components. At least one directory failed to add
properly:
ts
Check to make sure that this directory exists, is readable, etc.
--------------------------------------------------------------------------
I doubt it has anything to do with this ticket, unless there's some weird race with memory somewhere?
I ran osu_init
in a loop 50 time over 1025 nodes (1 rank per node, ppr:1:node
) with ompi-4.0.3 on Summit and had no problems.
Apparently this issue is specific to the environment I was running on. I can't get access to the machine right now (and I will need a large allocation anyway), so I will downgrade this to minor but keep it alive until I will be able to run more tests.
However, it would be good if I have a precise plan of what to run and how to run it, to be able to gather as much info as possible. So, what MCA should I try, what verbose to set to be able to get enough info to allow us to understand and hopefully fix this ?
For ticket trails, here are the run bits for my osu_init test:
[naughton@login3.summit osu-micro-benchmarks-5.6.2]$ env | grep UCX
UCX_DIR=/sw/summit/ums/ompix/gcc/6.4.0/install/ucx-1.7.0
UCX_INSTALL_DIR=/sw/summit/ums/ompix/gcc/6.4.0/install/ucx-1.7.0
UCX_MAX_RNDV_RAILS=2
UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1
[naughton@login3.summit osu-micro-benchmarks-5.6.2]$ env | grep MCA
OMPI_MCA_routed=direct
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_rmaps_base_no_schedule_local=1
OMPI_MCA_plm_rsh_no_tree_spawn=1
OMPI_MCA_io=romio321
mpirun \
--mca btl ^openib \
-np $nproc \
--nolocal \
--hostfile $LSB_DJOB_HOSTFILE \
--map-by ppr:$ppr:node \
--bind-to core \
-x PATH \
-x LD_LIBRARY_PATH \
$OSU_PATH/osu_init
@bosilca could you clarify please, what is your ulimit for filesize?
I verified that on AMD EPYC with 128ppn job and ds21 gds component requires a limit no less than 20480 for file size.
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1027595
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 16384
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 300000
cpu time (seconds, -t) unlimited
max user processes (-u) 1027595
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hi @rhc54, did this turn out to be a configuration issue or require a code change to fix in the end?
I'm also seeing the same out of resource issue:
PMIX ERROR: ERROR in file dstore_segment.c at line 207
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 661
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 1857
PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 2846
This is with OpenMPI 4.0.6-rc1 and PMIx 3.2.2.
Ulimit looks like this:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1028702
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 10485760
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The issue is happening intermittently with large jobs of 500-1000 nodes, but it never happens with smaller job sizes. It also only happens when launching with a debugger, without a debugger, launching always succeeds.
This could suggest that the debugger is using extra memory, and allocating memory is where PMIx seems to be failing. But the ulimit output suggests this shouldn't be an issue.
We were never able to replicate it, even on large runs. It's hard to believe you'd be hitting a true memory limit, so I suspect it is something else that is causing the problem. Can you provide any details on what you are doing - e.g., how you are launching with a debugger?
After digging a bit more, I think the true memory limit might be what we're seeing as well because it's 128 processes on a single node. I will have to eliminate that possibility first and then will come back here if I still think it's PMIx related. Thanks for the reply.
(Edit:The reason for the memory limit being hit when debugging is that there would be an instance of all the debug symbols held in memory for each process. So with 128 that can add up quite quickly)
Might see what happens if you use an app like /bin/true
which wouldn't have all the debug symbols. Let us know what you find either way as we'd really like to understand what is going on in this case.
I don't know if this is a red herring or not, but we fixed our startup issue by deleting the contents of /tmp. The first startup always worked, but subsequent ones would fail with OUT-OF-RESOURCE until we cleared the OpenMPI files in /tmp.
I've attached the zip of what was left behind by the first run: tmp.tar.gz
We were no where near the memory limit on either working or not working startup cases, using about 80GB out of 256GB.
@artpol84 @karasevb @jjhursey Looks like we have a cleanup issue in the v3.2 branch that is a significant problem for OMPI. Can someone take a look at it? Please see recent comments from @James-A-Clark above.
Note this critical caveat (not sure how it causes the problem):
It also only happens when launching with a debugger, without a debugger, launching always succeeds.
Background information
We were running some scale tests with Open MPI 4.0.3, using PMIx3. All small scale jobs, up to 512 nodes (single process per node) were successful, but most of the larger one fail due to some resource exhaustion.
What version of the PMIx Reference Library are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
v3.0 as shipped with Open MPI 4.0.3.
Describe how PMIx was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Release installation from OMPI, no additional flags. Compiled with Intel 19.
Please describe the system on which you are running
Details of the problem
When using the default options, 1k nodes jobs randomly (but mostly) failed, and 4k nodes jobs never passed PMIx_Init. The error report looked similar to the following.
Digging into the PMIx 3 code, the only way to have an OUT-OF_RESOURCE in common/dstore/dstore_segment.c:207 is to fail at creating the shm segment (pshmem_mmap.c:79) due to posix_fallocate. So either no space on the device hosting temporary files or the requested size was unreasonable.
We confirmed that all nodes had enough space in /dev/shm and /tmp, and that the number of open file descriptor was set high enough (> 4k). Moreover, while chatting with @rhc54 he suggested to force the PMIX_MCA_gds to hash. Doing so allowed all our 1k submissions to execute. As a result it seems the issue cannot be attributed to any system configuration, but is coming from PMIx, more precisely from the ds12 DGS component.