Running dcp with --uid and --gid frequently results in this message at the end. This is with openmpi v4.1.0 and running mpirun as root. I understand that running as root is frowned upon, but our current design forces our hand at this so that we can become any user to perform data movement inside of containers.
A full command typically looks something like this:
And here is the message that appears frequently after the dcp output:
# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have. It is likely that your MPI job will now either abort or
# experience performance degradation.
#
# Local host: nnf-dm-worker-xj45m
# System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
# Error: Operation not permitted (errno 1)
# --------------------------------------------------------------------------
Most of the time, this message appears to be harmless: dcp completes successfully and well as mpirun. However, there are some cases there it can segfault.
I tried to get at the root of the message and as part of that journey, I ended up upgrading to openmpi v4.1.6 to try to see if anything changes. It does and these message now appear with every single invocation of dcp:
[[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
[[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
[[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
[[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
Data unpack failed in file util/show_help.c at line 501\n"
I noticed that if you drop --uid and --gid, this goes away. That leads me to believe that there are some issues with the combination of mpirun as root and trying to do things in dcp as non-root.
In lieu of using dcp --uid/--gid, I tried to use setpriv before the dcp command:
[nnf-dm-worker-jzl9t:00111] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1849
[nnf-dm-worker-jzl9t:00121] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[nnf-dm-worker-jzl9t:00121] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nnf-dm-worker-jzl9t:00121] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[22121,1],3]
Exit code: 1
--------------------------------------------------------------------------
I think --uid and --gid need to only be used for the file operations and not mpi functionality. Is that possible?
Running
dcp
with--uid
and--gid
frequently results in this message at the end. This is with openmpi v4.1.0 and runningmpirun
as root. I understand that running as root is frowned upon, but our current design forces our hand at this so that we can become any user to perform data movement inside of containers.A full command typically looks something like this:
And here is the message that appears frequently after the
dcp
output:Most of the time, this message appears to be harmless:
dcp
completes successfully and well asmpirun
. However, there are some cases there it can segfault.I tried to get at the root of the message and as part of that journey, I ended up upgrading to openmpi v4.1.6 to try to see if anything changes. It does and these message now appear with every single invocation of dcp:
I noticed that if you drop
--uid
and--gid
, this goes away. That leads me to believe that there are some issues with the combination ofmpirun
as root and trying to do things indcp
as non-root.In lieu of using
dcp --uid/--gid
, I tried to usesetpriv
before the dcp command:But this results in an error:
I think
--uid
and--gid
need to only be used for the file operations and not mpi functionality. Is that possible?