Open akesandgren opened 1 year ago
@akesandgren, can you please attach the mentioned log files?
Exactly which ones do you mean?
in the original description you mentioned the following:
Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
do you have those logs available?
The config output yes, not the other one, i missed removing those lines originally. And we don't have a enable-logging build to run with.
The interesting part here is how two different tasks on the same node can end up with different outcomes on the choice of ucx or not. How it is actually built shouldn't really matter, it's a standard build with:
contrib/configure-release --prefix=/apps/Arch/software/UCX/1.10.0-GCCcore-10.3.0 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc
Ok, I see. Can you please upload outputs of both bad and good runs? Also please set UCX_LOG_LEVEL=debug when running the apps (even the release UCX version contains some debug traces)
Here's a tar file with a correct and a fail output. Both cases run on the same node. UCX-fail.tar.gz
thanks for the logs. The issue seems to be similar to #8511. There are the following errors when the job fails:
[1668610663.569809] [alvis-cpu1:2191091:0] mm_posix.c:194 UCX ERROR open(file_name=/proc/2191086/fd/40 flags=0x0) failed: No such file or directory
[1668610663.569831] [alvis-cpu1:2191091:0] mm_ep.c:154 UCX ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216eee: Shared memory error
[1668610663.573565] [alvis-cpu1:2191117:1] mm_posix.c:194 UCX ERROR open(file_name=/proc/2191116/fd/40 flags=0x0) failed: Permission denied
[1668610663.573574] [alvis-cpu1:2191117:1] mm_ep.c:154 UCX ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216f0c: Shared memory error
You are not using containers, right?
Does setting UCX_POSIX_USE_PROC_LINK=n
make the problem gone?
what is the output of ipcs -l
?
No, not using containers.
From the node I've been running on:
ipcs -l
------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
UCX_POSIX_USE_PROC_LINK=n does not fix the problem.
UCX 1.11.2 does not seem to have this problem, at least I've been unable to trigger it (still using OpenMPI 4.1.1)
A similar issue was fixed by https://github.com/open-mpi/ompi/pull/9505 which is part of OpenMPI 4.1.2 and above
Describe the bug
Sometimes, not always, when running OpenMPI 4.1.1 with UCX 1.10.1 the pml ucx component fails to find mlx5_core on some of the tasks in a single node run.
Steps to Reproduce
Setup and versions
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXWhat happens is that for a single node job the diff between two tasks for the pml ucx component is this:
Has this been fixed in later versions and if so which commit(s) are involved?