Open ebolandrtx opened 1 year ago
Seems similar to #8620
@ebolandrtx can you pls check if UCX_TLS=self,sm,ud_v
or UCX_TLS=self,sm,dc
makes the issue go away?
Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?
Seeing two new errors now (associated):
sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)
Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?
UCX_TLS=rc,ud,sm,self
will use RC transport. However, it's recommended to use DC with UCX_TLS=dc,self,sm
.
BTW, UCX would prefer using DC when possible, but AFAIR Intel MPI is setting UCX_TLS to use ud.
"ud_v" is a different (and slightly less performant) implementation, if it works it can help narrow down the issue.
Seeing two new errors now (associated):
sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)
Are you using static link?
Hi
We are transitioning to using the Red Hat provided UCX library and have noted that it is crashing when using Intel MPI when the core count exceeds a certain threshold. We are seeing the following error:
HOST:1151067:0:1151067] ud_ep.c:888 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed ==== backtrace (tid:1151067) ==== 0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fba21010edc] 1 /usr/lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fba2100dd41] 2 /usr/lib64/libucs.so.0(ucs_fatal_error_format+0x10f) [0x7fba2100de5f] 3 /usr/lib64/ucx/libuct_ib.so.0(+0x5b890) [0x7fba1f052890] 4 /usr/lib64/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x316) [0x7fba1f052d96] 5 /usr/lib64/ucx/libuct_ib.so.0(+0x6470d) [0x7fba1f05b70d] 6 /usr/lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fba214bdada] 7 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0xa7a1) [0x7fba217457a1] 8 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22b0d) [0x7fba2175db0d] 9 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22a97) [0x7fba2175da97] 10 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x62b3fe) [0x7fbb704043fe] 11 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1fa7a1) [0x7fbb6ffd37a1] 12 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x78eb7e) [0x7fbb70567b7e] 13 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x371f43) [0x7fbb7014af43] 14 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x263faa) [0x7fbb7003cfaa] 15 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x16a7a2) [0x7fbb6ff437a2] 16 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x19e9cd) [0x7fbb6ff779cd] 17 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1ba1c7) [0x7fbb6ff931c7] 18 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x185bd4) [0x7fbb6ff5ebd4] 19 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x165780) [0x7fbb6ff3e780] 20 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x2674fd) [0x7fbb700404fd] 21 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(MPI_Bcast+0x51f) [0x7fbb6ff26a8f]
We are using the native RHEL8 UCX:
Version 1.13.0 Git branch '', revision 6765970 Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
Our infiniband device info is:
hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.24.1000
I suspect this is related to the rc vs. dc communication at higher core counts based on some of the issues I've read, and would appreciate some discussion to fix.