openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

UCX Crashing with > 160 cores when using Intel MPI #9071

Open ebolandrtx opened 1 year ago

ebolandrtx commented 1 year ago

Hi

We are transitioning to using the Red Hat provided UCX library and have noted that it is crashing when using Intel MPI when the core count exceeds a certain threshold. We are seeing the following error:

HOST:1151067:0:1151067] ud_ep.c:888 Assertion `ctl->type == UCT_UD_PACKET_CREP' failed ==== backtrace (tid:1151067) ==== 0 /usr/lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fba21010edc] 1 /usr/lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fba2100dd41] 2 /usr/lib64/libucs.so.0(ucs_fatal_error_format+0x10f) [0x7fba2100de5f] 3 /usr/lib64/ucx/libuct_ib.so.0(+0x5b890) [0x7fba1f052890] 4 /usr/lib64/ucx/libuct_ib.so.0(uct_ud_ep_process_rx+0x316) [0x7fba1f052d96] 5 /usr/lib64/ucx/libuct_ib.so.0(+0x6470d) [0x7fba1f05b70d] 6 /usr/lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fba214bdada] 7 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0xa7a1) [0x7fba217457a1] 8 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22b0d) [0x7fba2175db0d] 9 /apps/intelmpi_2021.6.0.602/mpi/latest/libfabric/lib/prov/libmlx-fi.so(+0x22a97) [0x7fba2175da97] 10 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x62b3fe) [0x7fbb704043fe] 11 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1fa7a1) [0x7fbb6ffd37a1] 12 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x78eb7e) [0x7fbb70567b7e] 13 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x371f43) [0x7fbb7014af43] 14 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x263faa) [0x7fbb7003cfaa] 15 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x16a7a2) [0x7fbb6ff437a2] 16 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x19e9cd) [0x7fbb6ff779cd] 17 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x1ba1c7) [0x7fbb6ff931c7] 18 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x185bd4) [0x7fbb6ff5ebd4] 19 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x165780) [0x7fbb6ff3e780] 20 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(+0x2674fd) [0x7fbb700404fd] 21 /apps/intelmpi_2021.6.0.602/mpi/latest/lib/release/libmpi.so.12(MPI_Bcast+0x51f) [0x7fbb6ff26a8f]

We are using the native RHEL8 UCX:

Version 1.13.0 Git branch '', revision 6765970 Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni

Our infiniband device info is:

hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.24.1000

I suspect this is related to the rc vs. dc communication at higher core counts based on some of the issues I've read, and would appreciate some discussion to fix.

yosefe commented 1 year ago

Seems similar to #8620

yosefe commented 1 year ago

@ebolandrtx can you pls check if UCX_TLS=self,sm,ud_v or UCX_TLS=self,sm,dc makes the issue go away?

ebolandrtx commented 1 year ago

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

ebolandrtx commented 1 year ago

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)

yosefe commented 1 year ago

Hi, i ended up setting: UCX_TLS=rc,ud,sm,self and that world. Is there a difference between ud_v and ud that i should be aware of, and should i be using dc instead of rc?

UCX_TLS=rc,ud,sm,self will use RC transport. However, it's recommended to use DC with UCX_TLS=dc,self,sm. BTW, UCX would prefer using DC when possible, but AFAIR Intel MPI is setting UCX_TLS to use ud. "ud_v" is a different (and slightly less performant) implementation, if it works it can help narrow down the issue.

Seeing two new errors now (associated):

sys.c:314 UCX WARN could not find address of current library: (null) module.c:68 UCX ERROR dladdr failed: (null)

Are you using static link?