openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

UCX 1.10.1 sometimes fails to find mlx5_core for some MPI tasks on single node runs #8704

Open akesandgren opened 1 year ago

akesandgren commented 1 year ago

Describe the bug

Sometimes, not always, when running OpenMPI 4.1.1 with UCX 1.10.1 the pml ucx component fails to find mlx5_core on some of the tasks in a single node run.

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

What happens is that for a single node job the diff between two tasks for the pml ucx component is this:

--- /tmp/good.ucx.1     2022-11-16 07:12:47.124598108 +0100
+++ /tmp/bad.ucx.1      2022-11-16 07:12:53.156640711 +0100
@@ -13,15 +13,27 @@
 common_ucx.c:303 tcp/ens1f1: did not match transport list
 common_ucx.c:303 tcp/ib0: did not match transport list
 common_ucx.c:303 tcp/ens1f1.1044: did not match transport list
-common_ucx.c:198 driver '../../../../bus/pci/drivers/mlx5_core' matched by 'mlx*'
-common_ucx.c:294 rc_verbs/mlx5_0:1: matched both transport and device list
-common_ucx.c:311 support level is transports and devices
+common_ucx.c:298 rc_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_0:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_0:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_0:1: did not match transport list
+common_ucx.c:298 rc_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 rc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 dc_mlx5/mlx5_1:1: matched transport list but not device list
+common_ucx.c:298 ud_verbs/mlx5_1:1: matched transport list but not device list
+common_ucx.c:303 ud_mlx5/mlx5_1:1: did not match transport list
+common_ucx.c:303 cma/memory: did not match transport list
+common_ucx.c:311 support level is transports only
 pml_ucx.c:289 mca_pml_ucx_init
 pml_ucx.c:114 Pack remote worker address, size 482
 pml_ucx.c:114 Pack local worker address, size 618
-pml_ucx.c:351 created ucp context 0x23184d0, worker 0x28283a0
-pml_ucx_component.c:129 returning priority 51
-selected ucx best priority 51
-select: component ucx selected
-pml_ucx.c:182 Got proc 2 address, size 618
-pml_ucx.c:411 connecting to proc. 2
+pml_ucx.c:351 created ucp context 0x902650, worker 0xe053d0
+pml_ucx_component.c:129 returning priority 19
+pml_ucx.c:367 mca_pml_ucx_cleanup
+select: component ucx not selected / finalized
+pml_ucx.c:268 mca_pml_ucx_close
+mca: base: close: component ucx closed
+mca: base: close: unloading component ucx
+check:select: checking my pml cm against process [[33740,728],0] pml ucx
+[[33740,728],10] selected pml cm, but peer [[33740,728],0] on alvis-cpu1 selected pml ucx

Has this been fixed in later versions and if so which commit(s) are involved?

brminich commented 1 year ago

@akesandgren, can you please attach the mentioned log files?

akesandgren commented 1 year ago

Exactly which ones do you mean?

brminich commented 1 year ago

in the original description you mentioned the following:

Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

do you have those logs available?

akesandgren commented 1 year ago

The config output yes, not the other one, i missed removing those lines originally. And we don't have a enable-logging build to run with.

The interesting part here is how two different tasks on the same node can end up with different outcomes on the choice of ucx or not. How it is actually built shouldn't really matter, it's a standard build with:

contrib/configure-release --prefix=/apps/Arch/software/UCX/1.10.0-GCCcore-10.3.0 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc

UCX.config.txt

brminich commented 1 year ago

Ok, I see. Can you please upload outputs of both bad and good runs? Also please set UCX_LOG_LEVEL=debug when running the apps (even the release UCX version contains some debug traces)

akesandgren commented 1 year ago

Here's a tar file with a correct and a fail output. Both cases run on the same node. UCX-fail.tar.gz

brminich commented 1 year ago

thanks for the logs. The issue seems to be similar to #8511. There are the following errors when the job fails:

[1668610663.569809] [alvis-cpu1:2191091:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191086/fd/40 flags=0x0) failed: No such file or directory
[1668610663.569831] [alvis-cpu1:2191091:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216eee: Shared memory error

[1668610663.573565] [alvis-cpu1:2191117:1]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/2191116/fd/40 flags=0x0) failed: Permission denied
[1668610663.573574] [alvis-cpu1:2191117:1]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000a00216f0c: Shared memory error

You are not using containers, right? Does setting UCX_POSIX_USE_PROC_LINK=n make the problem gone? what is the output of ipcs -l ?

akesandgren commented 1 year ago

No, not using containers.

From the node I've been running on:

ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

UCX_POSIX_USE_PROC_LINK=n does not fix the problem.

UCX 1.11.2 does not seem to have this problem, at least I've been unable to trigger it (still using OpenMPI 4.1.1)

yosefe commented 1 year ago

A similar issue was fixed by https://github.com/open-mpi/ompi/pull/9505 which is part of OpenMPI 4.1.2 and above