Open afborchert opened 2 years ago
@hjelmn @bwbarrett @hppritcha Is there any way that this is related to #9868 / #9880? I know that those are about XPMEM and unrelated to Solaris, but is there a larger issue in v4.1.x:vader/master:sm cleanup?
I do not think so as no signals (SIGBUS or whatever) were observed in any of the processes.
@afborchert could you add the -tag-output
option to the mpirun command line? That may help us better diagnose what's going on.
Sure:
theon$ mpirun -tag-output -np 4 mpi-test
[1,0]<stdout>:3
[1,0]<stdout>:1
[1,0]<stdout>:2
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: theon
System call: unlink(2) /tmp/ompi.theon.120/pid.21254/1/vader_segment.theon.120.1b4d0001.2
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
theon$
I've added a run with “-d” as well:
theon$ mpirun -d -np 4 mpi-test
[theon:21381] procdir: /tmp/ompi.theon.120/pid.21381/0/0
[theon:21381] jobdir: /tmp/ompi.theon.120/pid.21381/0
[theon:21381] top: /tmp/ompi.theon.120/pid.21381
[theon:21381] top: /tmp/ompi.theon.120
[theon:21381] tmp: /tmp
[theon:21381] sess_dir_cleanup: job session dir does not exist
[theon:21381] sess_dir_cleanup: top session dir not empty - leaving
[theon:21381] procdir: /tmp/ompi.theon.120/pid.21381/0/0
[theon:21381] jobdir: /tmp/ompi.theon.120/pid.21381/0
[theon:21381] top: /tmp/ompi.theon.120/pid.21381
[theon:21381] top: /tmp/ompi.theon.120
[theon:21381] tmp: /tmp
[theon:21381] [[7118,0],0] Releasing job data for [INVALID]
[theon:21385] procdir: /tmp/ompi.theon.120/pid.21381/1/3
[theon:21385] jobdir: /tmp/ompi.theon.120/pid.21381/1
[theon:21385] top: /tmp/ompi.theon.120/pid.21381
[theon:21385] top: /tmp/ompi.theon.120
[theon:21385] tmp: /tmp
[theon:21382] procdir: /tmp/ompi.theon.120/pid.21381/1/0
[theon:21382] jobdir: /tmp/ompi.theon.120/pid.21381/1
[theon:21382] top: /tmp/ompi.theon.120/pid.21381
[theon:21382] top: /tmp/ompi.theon.120
[theon:21382] tmp: /tmp
[theon:21384] procdir: /tmp/ompi.theon.120/pid.21381/1/2
[theon:21384] jobdir: /tmp/ompi.theon.120/pid.21381/1
[theon:21384] top: /tmp/ompi.theon.120/pid.21381
[theon:21384] top: /tmp/ompi.theon.120
[theon:21384] tmp: /tmp
[theon:21383] procdir: /tmp/ompi.theon.120/pid.21381/1/1
[theon:21383] jobdir: /tmp/ompi.theon.120/pid.21381/1
[theon:21383] top: /tmp/ompi.theon.120/pid.21381
[theon:21383] top: /tmp/ompi.theon.120
[theon:21383] tmp: /tmp
2
1
3
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: theon
System call: unlink(2) /tmp/ompi.theon.120/pid.21381/1/vader_segment.theon.120.1bce0001.3
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 4
MPIR_proctable:
(i, host, exe, pid) = (0, theon, /home/borchert/tmp/ompi/mpi-test, 21382)
(i, host, exe, pid) = (1, theon, /home/borchert/tmp/ompi/mpi-test, 21383)
(i, host, exe, pid) = (2, theon, /home/borchert/tmp/ompi/mpi-test, 21384)
(i, host, exe, pid) = (3, theon, /home/borchert/tmp/ompi/mpi-test, 21385)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[theon:21381] sess_dir_finalize: proc session dir does not exist
[theon:21381] sess_dir_finalize: job session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: top session dir not empty - leaving
[theon:21381] Job [7118,1] has launched
[theon:21381] sess_dir_finalize: proc session dir does not exist
[theon:21381] sess_dir_finalize: job session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: top session dir not empty - leaving
[theon:21381] sess_dir_finalize: proc session dir does not exist
[theon:21381] sess_dir_finalize: job session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: top session dir not empty - leaving
[theon:21381] sess_dir_finalize: proc session dir does not exist
[theon:21381] sess_dir_finalize: job session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: jobfam session dir not empty - leaving
[theon:21381] sess_dir_finalize: top session dir not empty - leaving
[theon:21381] sess_dir_finalize: proc session dir does not exist
[theon:21381] sess_dir_finalize: job session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir does not exist
[theon:21381] sess_dir_finalize: jobfam session dir does not exist
[theon:21381] sess_dir_finalize: top session dir not empty - leaving
[theon:21381] sess_dir_cleanup: job session dir does not exist
[theon:21381] sess_dir_cleanup: top session dir not empty - leaving
[theon:21381] [[7118,0],0] Releasing job data for [7118,0]
[theon:21381] [[7118,0],0] Releasing job data for [7118,1]
[theon:21381] sess_dir_cleanup: job session dir does not exist
[theon:21381] sess_dir_cleanup: top session dir not empty - leaving
exiting with status 0
theon$
Having difficulties reproducing. I do observe the session dir does not exist... messages from mpirun when using the -d option. These appear to be harmless. Does this problem appear to be specific to your solaris system?
I'm getting nearly identical results on MacOS 12.2, with openmpi 4.1.2 installed via homebrew. I'm using software called Bertini.
mpiexec -np 2 --oversubscribe -tag-output -d bertini
[MATH2674510x:18559] procdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/0/0
[MATH2674510x:18559] jobdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/0
[MATH2674510x:18559] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559
[MATH2674510x:18559] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503
[MATH2674510x:18559] tmp: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T/
[MATH2674510x:18559] sess_dir_cleanup: job session dir does not exist
[MATH2674510x:18559] sess_dir_cleanup: top session dir does not exist
[MATH2674510x:18559] procdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/0/0
[MATH2674510x:18559] jobdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/0
[MATH2674510x:18559] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559
[MATH2674510x:18559] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503
[MATH2674510x:18559] tmp: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T/
[MATH2674510x:18559] [[20033,0],0] Releasing job data for [INVALID]
[MATH2674510x:18559] Job [20033,1] has launched
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, MATH2674510x, /usr/local/bin/bertini, 18560)
(i, host, exe, pid) = (1, MATH2674510x, /usr/local/bin/bertini, 18561)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[1,0]<stderr>:[MATH2674510x:18560] procdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/1/0
[1,0]<stderr>:[MATH2674510x:18560] jobdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/1
[1,0]<stderr>:[MATH2674510x:18560] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559
[1,0]<stderr>:[MATH2674510x:18560] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503
[1,0]<stderr>:[MATH2674510x:18560] tmp: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T/
[1,1]<stderr>:[MATH2674510x:18561] procdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/1/1
[1,1]<stderr>:[MATH2674510x:18561] jobdir: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559/1
[1,1]<stderr>:[MATH2674510x:18561] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503/pid.18559
[1,1]<stderr>:[MATH2674510x:18561] top: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T//ompi.MATH2674510x.503
[1,1]<stderr>:[MATH2674510x:18561] tmp: /var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T/
[1,0]<stdout>:
[1,0]<stdout>: Bertini(TM) v1.6
[1,0]<stdout>: (May 22, 2018)
... [snipped Bertini output]
[1,0]<stdout>:Summary for the last regeneration level:
[1,0]<stdout>:Paths Tracked: 512
[MATH2674510x:18559] sess_dir_finalize: proc session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: job session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: top session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: proc session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: job session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: top session dir not empty - leaving
[MATH2674510x:18559] sess_dir_finalize: proc session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: job session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: jobfam session dir does not exist
[MATH2674510x:18559] sess_dir_finalize: top session dir does not exist
[MATH2674510x:18559] sess_dir_cleanup: job session dir does not exist
[MATH2674510x:18559] sess_dir_cleanup: top session dir does not exist
[MATH2674510x:18559] [[20033,0],0] Releasing job data for [20033,0]
[MATH2674510x:18559] [[20033,0],0] Releasing job data for [20033,1]
[MATH2674510x:18559] sess_dir_cleanup: job session dir does not exist
[MATH2674510x:18559] sess_dir_cleanup: top session dir does not exist
The program still produces its results, but I'm getting these errors from MPI. Two additional pieces of information:
/var/folders/53/j42yr5ln4753p482rthz0bfr0000gq/T
, with rwx permissions. T
does not contain any folder starting with ompi*
.The issue with OS X is likely a different (and well known) one:
$TMPDIR
is pretty long by default, and it leads to path being truncated.
Try to
export TMPDIR=/tmp
before invoking mpirun
on OS X and see how it goes.
export TMPDIR=/tmp
Indeed, on MacOS, this worked. Please pardon if my comments read as noise on this issue. Thank you very much!
Was this solved for @afborchert?
Oops -- I didn't notice that the "That fixed it!" notice wasn't from the original poster. My bad!
@afborchert Can you chime in as to whether that fixed it for you?
@ofloveandhate No, the close by @jsquyres was a surprise to me. However, I am not sure right now how I can help @hppritcha and possibly others to analyze this problem. Right now, we have OpenMP 4.1.2 just on Solaris. The messages appear indeed to be harmless but are, of course, annoying. This effect is intermittent and attempts to have this happening while running under truss (system call logger comparable to strace) so far never reproduced the problem. Please also note that the problem under Solaris is unrelated to the MacOS problem.
This problem is still present in v4.1.4.
Might this be related to #11123? See the second item in the issue, "Issue 2, error unlinking mmap backing file". There, I've observed the backing file getting deleted before its time somewhere else in the code, leading to the intended unlink call failing.
In my scenario I LD_PRELOADed remove/unlink to find where this was happening. If you can reproduce it you could also use something like this to find it? Snippet of what I had:
#define _GNU_SOURCE
#include <stdio.h>
#include <dlfcn.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
// gcc -shared -fPIC -o ul.so ul.c
int unlink(const char *path) {
int (*original)(const char *)
= dlsym(RTLD_NEXT, "unlink");
static int cnt = 0;
printf("==Unlink @ %s (%d)\n", path, ++cnt);
return original(path);
}
int remove(const char *path) {
int (*original)(const char *)
= dlsym(RTLD_NEXT, "remove");
static int cnt = 0;
printf("==Remove @ %s (%d)\n", path, ++cnt);
return original(path);
}
(And when I captured the phenomenon using these functions, a carefully executed segfault to generate a backtrace)
This is a case of OMPI registering an "epilog" call with PMIx - in this case, to remove the shmem backing file - and then duplicating the action in OMPI. It doesn't matter to PMIx who does the operation - we're just doing it because OMPI asked us to do so. Easy solution is to either have OMPI not register the "epilog", or to remove the "unlink" operation from OMPI.
I've struggled a little bit reproducing the problem with a preloaded wrapper of unlink as suggested by @gkatev but was successful when I adapted it to postpone the output to an atexit handler:
theon$ make
LD_PRELOAD=/home/borchert/tmp/mpi-unlink/unlink.so mpirun -np 4 mpi-test
2
1
3
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: theon
System call: unlink(2) /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.0
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
unlinked files:
[ 986, 0] /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.1
unlinked files:
[ 988, 0] /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.3
unlinked files:
[ 987, 0] /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.2
unlinked files:
[ 985, 0] /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.0
unlinked files:
[ 984, 0] /tmp/ompi.theon.120/pid.984/1/vader_segment.theon.120.4b930001.0
[ 984, 1] /tmp/ompi.theon.120/pid.984/0/debugger_attach_fifo
[ 984, 2] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds21_984/smseg-1267924993-0
[ 984, 3] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds21_984/smdataseg-1267924993-0
[ 984, 4] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds21_984/initial-pmix_shared-segment-0
[ 984, 5] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds21_984/smlockseg-1267924993
[ 984, 6] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds21_984/smlockseg-1267924993
[ 984, 7] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds12_984/initial-pmix_shared-segment-0
[ 984, 8] /tmp/ompi.theon.120/pid.984/pmix_dstor_ds12_984/dstore_sm.lock
[ 984, 9] /tmp/ompi.theon.120/pid.984/0/debugger_attach_fifo
[ 984, 10] /tmp/ompi.theon.120/pid.984/contact.txt
theon$
The first number in each line is the pid, the second the counter. As it appears the file _/tmp/ompi.theon.120/pid.984/1/vadersegment.theon.120.4b930001.0 is unlinked by two processes, 984 and 985. In cases when it runs through without errors, each process appears to unlink just one vader_segment file like here:
theon$ make
LD_PRELOAD=/home/borchert/tmp/mpi-unlink/unlink.so mpirun -np 4 mpi-test
1
2
3
unlinked files:
[ 980, 0] /tmp/ompi.theon.120/pid.977/1/vader_segment.theon.120.4b9a0001.2
unlinked files:
[ 981, 0] /tmp/ompi.theon.120/pid.977/1/vader_segment.theon.120.4b9a0001.3
unlinked files:
[ 978, 0] /tmp/ompi.theon.120/pid.977/1/vader_segment.theon.120.4b9a0001.0
unlinked files:
[ 979, 0] /tmp/ompi.theon.120/pid.977/1/vader_segment.theon.120.4b9a0001.1
unlinked files:
[ 977, 0] /tmp/ompi.theon.120/pid.977/0/debugger_attach_fifo
[ 977, 1] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds21_977/smseg-1268383745-0
[ 977, 2] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds21_977/smdataseg-1268383745-0
[ 977, 3] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds21_977/initial-pmix_shared-segment-0
[ 977, 4] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds21_977/smlockseg-1268383745
[ 977, 5] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds21_977/smlockseg-1268383745
[ 977, 6] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds12_977/initial-pmix_shared-segment-0
[ 977, 7] /tmp/ompi.theon.120/pid.977/pmix_dstor_ds12_977/dstore_sm.lock
[ 977, 8] /tmp/ompi.theon.120/pid.977/0/debugger_attach_fifo
[ 977, 9] /tmp/ompi.theon.120/pid.977/contact.txt
theon$
Here is the updated unlink wrapper:
#include <dlfcn.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static void finish(void) {
if (path_index > 0) {
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %2d] %s\n", pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
int unlink(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
return original(path);
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
}
Now, I've tried to get an abort & core dump as soon as one of the vader files gets unlinked with error. Here is the output of such a case:
theon$ make
LD_PRELOAD=/home/borchert/tmp/mpi-unlink/unlink.so mpirun -np 4 mpi-test
3
1
2
[1524] unlink("/tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.0") -> -1 (errno = 2)
what happened before:
[theon:01524] *** Process received signal ***
[theon:01524] Signal: Abort (6)
[theon:01524] Signal code: (-1)
[theon:01524] [ 0] /lib/amd64/libc.so.1'__sighndlr+0x6 [0x7fffbdc47eb6]
[theon:01524] [ 1] /lib/amd64/libc.so.1'call_user_handler+0x2f1 [0x7fffbdc39851]
[theon:01524] [ 2] /lib/amd64/libc.so.1'__lwp_sigqueue+0xa [0x7fffbdc4eaea]
[theon:01524] [ 3] /lib/amd64/libc.so.1'raise+0x19 [0x7fffbdb85dc9]
[theon:01524] [ 4] /lib/amd64/libc.so.1'abort+0x60 [0x7fffbdb4f7c0]
[theon:01524] [ 5] /home/borchert/tmp/mpi-unlink/unlink.so'unlink+0xd3 [0x7fff4dc00af8]
[theon:01524] [ 6] /opt/ulm/dublin/lib/amd64/openmpi/mca_shmem_mmap.so'0x1353 [0x7fff63c01353]
[theon:01524] [ 7] /opt/ulm/dublin/lib/amd64/openmpi/mca_btl_vader.so'0x2456 [0x7fff56002456]
[theon:01524] [ 8] /opt/ulm/dublin/lib/amd64/libopen-pal.so.40.30.2'0x748a6 [0x7fff65c748a6]
[theon:01524] [ 9] /opt/ulm/dublin/lib/amd64/libopen-pal.so.40.30.2'mca_base_framework_close+0x7c [0x7fff65c5e2ec]
[theon:01524] [10] /opt/ulm/dublin/lib/amd64/libopen-pal.so.40.30.2'mca_base_framework_close+0x7c [0x7fff65c5e2ec]
[theon:01524] [11] /opt/ulm/dublin/lib/amd64/libmpi.so.40.30.4'ompi_mpi_finalize+0x87d [0x7fff58c5389d]
[theon:01524] [12] /home/borchert/tmp/mpi-unlink/mpi-test'0xc7a [0x400c7a]
[theon:01524] [13] /home/borchert/tmp/mpi-unlink/mpi-test'0xa23 [0x400a23]
[theon:01524] *** End of error message ***
unlinked files:
[ 1527, 0] /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.3
unlinked files:
[ 1526, 0] /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.2
unlinked files:
[ 1525, 0] /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.1
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node theon exited on signal 6 (Abort).
--------------------------------------------------------------------------
unlinked files:
[ 1522, 0] /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.0
[ 1522, 1] /tmp/ompi.theon.120/pid.1522/0/debugger_attach_fifo
[ 1522, 2] /tmp/ompi.theon.120/pid.1522/0/debugger_attach_fifo
[ 1522, 3] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/initial-pmix_shared-segment-0
[ 1522, 4] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smlockseg-1303969793
[ 1522, 5] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smlockseg-1303969793
[ 1522, 6] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smseg-1303969793-0
[ 1522, 7] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smdataseg-1303969793-0
[ 1522, 8] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smseg-1303969793-0
[ 1522, 9] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds21_1522/smdataseg-1303969793-0
[ 1522, 10] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds12_1522/initial-pmix_shared-segment-0
[ 1522, 11] /tmp/ompi.theon.120/pid.1522/pmix_dstor_ds12_1522/dstore_sm.lock
[ 1522, 12] /tmp/ompi.theon.120/pid.1522/contact.txt
make: *** [Makefile:3: test] Error 134
theon$
This is the updated version of the preloaded unlink wrapper:
#include <dlfcn.h>
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static void finish(void) {
if (path_index > 0) {
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %2d] %s\n", pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
int unlink(const char *path) {
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
int rval = original(path);
if (rval < 0) {
if (strstr(path, "vader")) {
fprintf(stderr, "[%d] unlink(\"%s\") -> %d (errno = %d)\n",
(int) getpid(), path, rval, errno);
fprintf(stderr, "what happened before:\n");
finish();
abort();
}
}
record(path);
return rval;
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
}
So (assuming I'm interpreting the output correctly) the file /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.0
gets unlinked twice, by PIDs 1524 and 1522:
1. [1524] unlink("/tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.0")
2. [ 1522, 0] /tmp/ompi.theon.120/pid.1522/1/vader_segment.theon.120.4db90001.0
The trace of 1524 looks about right? (right = where we expect unlink to be triggered). I'm assuming the path it takes is vader_finalize -> opal_shmem_unlink -> (shmem/mmap) segment_unlink
, though we'd optimally need a trace with files/lines to be sure.
So I would be curious where the other unlink, by 1522, took place. Could you trigger an abort when the specific vader segment gets unlinked even if the call was successful? If you have a debug build you might also use --mca opal_signal ""
to not use the internal signal handler and get a core dump to use with gdb to see the files/lines (AFAIK ompi's backtrace does not show that).
Ok, I adapted the preloaded wrapper to abort whenever a vader segment file is unlinked by the parent process. Here are the results:
theon$ make
LD_PRELOAD=/home/borchert/tmp/mpi-unlink/unlink.so mpirun -np 4 mpi-test
1
2
3
[2069, 2071] unlink("/tmp/ompi.theon.120/pid.2071/1/vader_segment.theon.120.405c0001.3") -> 0 (errno = 2)
make: *** [Makefile:3: test] Abort (core dumped)
theon$
And here is the backtrace without lines as MPI has been compiled without “-g” flag:
theon$ adb mpi-test core
core file = core -- program ``mpi-test'' on platform i86pc
SIGABRT: Abort
$C
00007fffbf3fe8c0 libc.so.1`__lwp_sigqueue+0xa()
00007fffbf3fe8e0 libc.so.1`raise+0x19()
00007fffbf3fe930 libc.so.1`abort+0xd1()
00007fffbf3fe970 unlink.so`unlink+0xf8()
00007fffbf3fea50 mca_pmix_pmix3x.so`dirpath_destroy+0x166()
00007fffbf3feb30 mca_pmix_pmix3x.so`pmix_execute_epilog+0x41e()
00007fffbf3fecb0 mca_pmix_pmix3x.so`server_switchyard+0xc25()
00007fffbf3fed00 mca_pmix_pmix3x.so`pmix_server_message_handler+0x45()
00007fffbf3fee70 mca_pmix_pmix3x.so`pmix_ptl_base_process_msg+0x1eb()
00007fffbf3feef0
libevent_core-2.1.so.6.0.2`event_process_active_single_queue+0x485()
00007fffbf3fef80 libevent_core-2.1.so.6.0.2`event_base_loop+0x50f()
00007fffbf3fefa0 mca_pmix_pmix3x.so`progress_engine+0x26()
00007fffbf3fefc0 libc.so.1`_thrp_setup+0xa4()
00007fffbf3fefd0 libc.so.1`_lwp_start()
theon$
And for documentation purposes the updated preloaded wrapper:
#include <dlfcn.h>
#include <errno.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// gcc -shared -fPIC -o ul.so ul.c
static int initialized = 0;
enum { max_entries = 32, max_length = 128 };
static char paths[max_entries][max_length];
static int path_index = 0;
static int forked = 0;
static void finish(void) {
if (path_index > 0) {
int ppid = getppid();
int pid = getpid();
fprintf(stderr, "unlinked files:\n");
for (int index = 0; index < path_index; ++index) {
fprintf(stderr, "[%5d, %5d, %2d] %s\n", ppid, pid, index, paths[index]);
}
}
}
static void record(const char* path) {
if (!initialized) {
atexit(finish);
initialized = true;
}
if (path_index < max_entries) {
strncpy(paths[path_index++], path, max_length);
}
}
pid_t fork(void) {
int (*original)(void) = dlsym(RTLD_NEXT, "fork");
++forked;
return original();
}
int unlink(const char *path) {
int (*original)(const char *) = dlsym(RTLD_NEXT, "unlink");
int rval = original(path);
if (forked && strstr(path, "vader")) {
fprintf(stderr, "[%d, %d] unlink(\"%s\") -> %d (errno = %d)\n",
(int) getppid(), (int) getpid(), path, rval, errno);
if (path_index > 0) {
fprintf(stderr, "what happened before:\n");
finish();
}
abort();
}
record(path);
return rval;
}
int remove(const char *path) {
record(path);
int (*original)(const char *) = dlsym(RTLD_NEXT, "remove");
return original(path);
}
Perfect, so now this does sound like what I observed in #11123. According to @rhc54's comment above pmix_execute_epilog
unlinks the backing file because ompi requested it does so (?). I myself don't know where or how this epilog registration takes place (and what the desired fix-approach is).
So here is the logic behind the epilog registration. The "vader" component usually puts its backing file under the scratch session directory as specified by its local mpirun daemon, or by the resource manager. The session directory gets cleaned up by mpirun
during finalize, and by resource managers if the application was launched using something like srun
.
I forget the root cause, but sometimes the "vader" component puts its backing file under the /dev
directory instead of under the scratch session directory. When this happens, there can be no cleanup by the RM or mpirun
as they don't know this has happened. If the proc exits without doing a clean finalize (e.g., segfaults), then this backing file in the /dev
directory gets left behind. System administrators notice things like that and get very upset about it.
So we added a "hook" that allows "vader" to register its backing file for cleanup with the local mpirun
daemon. This still isn't perfect (e.g., the daemon could segfault as well as the app), but it provides an extra level of protection to ensure that any backing file placed in /dev
gets cleaned up. As a result, our complaints from system administrators have pretty much evaporated.
The problem is that OMPI still attempts to unlink
that file, which means we have a race condition between the two unlink
commands. IIRC, PMIx does the "check for file existence and unlink if found", but that is also a time-of-check/time-of-use race condition.
I believe OMPI retained the cleanup as an insurance policy - if the daemon fails, then the app will finalize and terminate itself. Without OMPI cleanup, the backing file won't be removed.
I don't know the correct answer - I suspect there is a decent argument that some annoyance over verbose unlink warnings is a reasonable price to pay for ensuring OMPI doesn't pollute system directories. However, I leave that for others to debate. I don't know if there is a way to silence the unlink
warnings, though I wonder if we can't execute them as a shell command and redirect their stderr to /dev/null
? Inefficient, but this is finalize - so who cares?
Thanks, this all makes sense. Perhaps this could be made to be a soft error and not show an error message to the user when unlink fails. Or not show an error message when it happens during finalization. Or, not show an error message when it happens during finalization and errno is 2 (No such file or directory)
.
Or if the pmix-safety-net-unlink happened strictly after the explicit fine-grain unlinks to catch all that the left got behind all this wouldn't be a problem, but I imagine there are bigger reasons that dictate the finalization order or lack thereof.
I believe the unlink messages comes from here: https://github.com/open-mpi/ompi/blob/ffb0adcdd677a02ab396145c40352f8189132b86/opal/mca/shmem/mmap/shmem_mmap_module.c#L590-L597
Precautions in regard to cleaning up files in /dev can be at least skipped for Solaris as non-root users cannot create anything below /dev and /dev/shm does not exist. As far as I know, /dev/shm appears to be a Linux-only feature.
Precautions in regard to cleaning up files in /dev can be at least skipped for Solaris as non-root users cannot create anything below /dev and /dev/shm does not exist. As far as I know, /dev/shm appears to be a Linux-only feature.
I suspect that is getting too targeted - imagine doing something like that for every edge case.
Thanks, this all makes sense. Perhaps this could be made to be a soft error and not show an error message to the user when unlink fails. Or not show an error message when it happens during finalization. Or, not show an error message when it happens during finalization and errno is
2 (No such file or directory)
.
I suspect it could be made into an optional warning message - perhaps only when verbosity has been turned up. All you would need to do is add a little wrapper around it:
if (-1 == unlink(ds_buf->seg_name)) {
if (1 < opal_output_get_verbosity(opal_shmem_base_framework.framework_output)) {
int err = errno;
char hn[OPAL_MAXHOSTNAMELEN];
gethostname(hn, sizeof(hn));
opal_show_help("help-opal-shmem-mmap.txt", "sys call fail", 1, hn,
"unlink(2)", ds_buf->seg_name, strerror(err), err);
return OPAL_ERROR;
}
return OPAL_SUCCESS;
}
I've now a backtrace with line numbers included:
#0 0x00007fffbdc4eaea in __lwp_sigqueue () from /lib/64/libc.so.1
#1 0x00007fffbdc452a2 in thr_kill () from /lib/64/libc.so.1
#2 0x00007fffbdb85dc9 in raise () from /lib/64/libc.so.1
#3 0x00007fffbdb4f831 in abort () from /lib/64/libc.so.1
#4 0x00007fff31400bd4 in unlink (path=0x7cd4c0 "/tmp/ompi.theon.120/pid.4910/1/vader_segment.theon.120.5b650001.2") at unlink.c:55
#5 0x00007fff45c5e50a in dirpath_destroy (path=0x7bd5e0 "/tmp/ompi.theon.120/pid.4910/1", cd=0x7ca500, epi=0x7d0248)
at include/pmix_globals.c:609
#6 0x00007fff45c5e0b7 in pmix_execute_epilog (epi=0x7d0248) at include/pmix_globals.c:502
#7 0x00007fff45c9da09 in pmix_server_purge_events (peer=0x7d0050, proc=0x0) at server/pmix_server.c:835
#8 0x00007fff45cb5d6b in server_switchyard (peer=0x7d0050, tag=114, buf=0x7fffbf3fed50) at server/pmix_server.c:3483
#9 0x00007fff45cb9082 in pmix_server_message_handler (pr=0x7d0050, hdr=0x7cc684, buf=0x7fffbf3fed50, cbdata=0x0)
at server/pmix_server.c:3701
#10 0x00007fff45d574b3 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x7cc5b0) at base/ptl_base_sendrecv.c:789
#11 0x00007fff8762d8a5 in event_process_active_single_queue () from /usr/lib/64/libevent_core-2.1.so.6
#12 0x00007fff8762e15f in event_base_loop () from /usr/lib/64/libevent_core-2.1.so.6
#13 0x00007fff45ce21bb in progress_engine (obj=0x752ce0) at runtime/pmix_progress_threads.c:232
#14 0x00007fffbdc47ab4 in _thrp_setup () from /lib/64/libc.so.1
#15 0x00007fffbdc47d90 in ?? () from /lib/64/libc.so.1
#16 0x0000000000000000 in ?? ()
Try wrapping the warning message as shown above. If that helps resolve the issue, then please file a PR with that change. It needs to go into the OMPI "main" branch first, and then get cherry-picked back to the v5 and v4.1 release branches.
We've been running into this problem with certain systems during our nightly testing. The common thread seems to be systems running Ubuntu 22.04 LTS. I applied the patch suggested by @rhc54 to our Open MPI 4 builds, and this seems to have alleviated most of the issue. However, we are wondering if there could be a secondary effect in MPI applications that also use shared memory apart from Open MPI's use of shared memory.
While testing the NWCHEM app with our Open MPI 4.1.5 build on an Ubuntu 22.03 LTS system, we see the following error messages when NWCHEM tries to clean up its shm segments:
[0] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[25] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[4] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[8] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[11] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[27] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[9] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[20] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[14] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[12] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[23] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[6] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[15] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[29] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[24] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
comex_free_local: shm_unlink: No such file or directory
[13] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[10] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[17] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[19] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[2] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[22] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[1] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[16] Received an Error in Communication: (-1) comex_free_local: shm_unlink
[26] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[21] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[28] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[7] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[5] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[3] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[18] Received an Error in Communication: (-1) comex_free_local: shm_unlink
comex_free_local: shm_unlink: No such file or directory
[30] Received an Error in Communication: (-1) comex_free_local: shm_unlink
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I believe this uses the vader shm mechanism on a single node with multiple GPUs.
I'm wondering if this could be tied to the shm cleanup behavior in Open MPI, where Open MPI is cleaning up the shm segments without the app being aware of it.
Thanks in advance.
comex_free_local: shm_unlink: No such file or directory
This obviously isn't coming from OMPI, so I'm not sure I understand the title of this issue since it wouldn't be coming from MPI_Finalize
. Are you perhaps saying you call MPI_Finalize
and then are trying to cleanup some files you created in the OMPI session directory tree?
If so, then yes, you would get such error messages as we cleanup everything in that tree during "finalize". If you want to continue using a file in there, then you probably need to delay calling MPI_Finalize
until done with it.
Sorry for the confusion - you are correct that the error messages are coming from NWCHEM and not OMPI. My question pertained to whether there could be an interaction between OMPI and NWCHEM involving shared memory, and it sounds like there is.
This question come to me from one of our applications engineers, so I will pass this along back to them and see what they say. Thanks!
Yeah, it sounds like they probably put their shmem files in our session directory tree - which is totally fine. Somewhat the point of the session directory. However, the constraint on doing so is that you cannot "finalize" OMPI until you are done with those files as we will clean them up.
Point me to the text of the MPI standard that says NWChem is not allowed to create its own shm files in any directory it wants.
OMPI should not delete shm files it didn't create. This is a bug and it needs to be fixed.
Sure they can - nobody is restricting it. But we created the session directory tree and are therefore obligated to clean it up. We get harpooned any time we fail to fully remove all the directories.
In the current version of mpirun
on OMPI's main
branch, we actually do delay cleanup until after the app terminates - but that was a choice I made for implementation. Negative is that now we don't cleanup those directories if the daemon fails, so I'm sure people will be back to complain about that scenario too.
You cannot please everyone - all you can do is try and explain the constraints 🤷♂️
Based on the source code of ComEx, it is using either /tmp
or /dev/shm
. OMPI creates neither of those directories. You cannot possible be arguing that OMPI is allowed to scrub files from /tmp that it did not create.
No, and we don't do that - so those unlink problems are something in their code. I know nothing about what they are doing. If the files are not in our session directory tree, then we won't touch them.
Background information
What version of Open MPI are you using?
4.1.2
Describe how Open MPI was installed
Downloaded from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.bz2, unpacked, and built with following script:
Please describe the system on which you are running
Details of the problem
Intermittently, even most simple MPI applications that are run locally with shared memory fail at MPI_Finalize() with errors as following:
This happens even for most trivial test programs like the following:
Just run mpirun multiple times and it will eventually fail:
We had the very same problem with Open MPI 4.1.1. We had no such problems with Open MPI 2.1.6.