openhpc / ohpc

OpenHPC Integration, Packaging, and Test Repo
http://openhpc.community
Apache License 2.0
866 stars 191 forks source link

ucx update from OpenHPC 3.2 broke existing openmpi5-gnu13 #2059

Open LaHaine opened 2 hours ago

LaHaine commented 2 hours ago

This is on AlmaLinux 9.4. After updating ucx-ohpc to 1.17.0-320.ohpc.1.1.x86_64, all binaries build wih openmpi5-gnu13 fail like this:

[pax10-01:548615:0:548615] Caught signal 11 (Segmentation fault: address not map
ped to object at address (nil))
[pax10-01:548616:0:548616] Caught signal 11 (Segmentation fault: address not map
ped to object at address (nil))
==== backtrace (tid: 548615) ====
 0  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f9a570a4c44]
 1  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(+0x31dff) [0x7f9a570a4dff] 2  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(+0x320c6) [0x7f9a570a50c6]
 3  /lib64/libc.so.6(+0x3e6f0) [0x7f9a56e3e6f0]
 4  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libuct.so.0(uct_md_query+0x1f) [0x7f9a
57049d8f]
 5  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libopen-pal.so.80(+0xbf85f) [0x7f
9a5719b85f]
 6  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libopen-pal.so.80(mca_btl_base_se
lect+0x108) [0x7f9a571969b8]
 7  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_bml_r2_component
_init+0x12) [0x7f9a572fc162]
 8  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_bml_base_init+0x
93) [0x7f9a572fa2c3]
 9  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(+0x28eafa) [0x7f9a57
48eafa]
10  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_pml_base_select+
0x466) [0x7f9a57485876]
11  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(+0xa641d) [0x7f9a572
a641d]
12  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(ompi_mpi_instance_init+0x64) [0x7f9a572a6bc4]
13  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0xaa) [0x7f9a5729965a]
14  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(MPI_Init+0x6d) [0x7f9a572cdf1d]
15  cpi() [0x40123e]
16  /lib64/libc.so.6(+0x29590) [0x7f9a56e29590]
17  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f9a56e29640]
18  cpi() [0x4010f5]
=================================
==== backtrace (tid: 548616) ====
 0  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f80766a4c44]
 1  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(+0x31dff) [0x7f80766a4dff]
 2  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libucs.so.0(+0x320c6) [0x7f80766a50c6]
 3  /lib64/libc.so.6(+0x3e6f0) [0x7f807643e6f0]
 4  /opt/ohpc/pub/mpi/ucx-ohpc/1.17.0/lib/libuct.so.0(uct_md_query+0x1f) [0x7f80
 5  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libopen-pal.so.80(+0xbf85f) [0x7f
807679b85f]
 6  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libopen-pal.so.80(mca_btl_base_select+0x108) [0x7f80767969b8]
 7  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_bml_r2_component_init+0x12) [0x7f80768fc162]
 8  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_bml_base_init+0x93) [0x7f80768fa2c3]
 9  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(+0x28eafa) [0x7f8076a8eafa]
10  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(mca_pml_base_select+0x466) [0x7f8076a85876]
11  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(+0xa641d) [0x7f80768a641d]
12  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(ompi_mpi_instance_init+0x64) [0x7f80768a6bc4]
13  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0xaa) [0x7f807689965a]
14  /opt/ohpc/pub/mpi/openmpi5-gnu13/5.0.3/lib/libmpi.so.40(MPI_Init+0x6d) [0x7f80768cdf1d]
15  cpi() [0x40123e]
16  /lib64/libc.so.6(+0x29590) [0x7f8076429590]
17  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f8076429640]
18  cpi() [0x4010f5]
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 548615 on node pax10-01 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The same binary runs fine with openmpi5-gnu14-ohpc-5.0.5-320.ohpc.2.1.x86_64.

LaHaine commented 1 hour ago

Rebuilding openmpi5 against ucx 1.17.0 doesn't change this.

adrianreber commented 1 hour ago

Just tried it and I cannot reproduce it. Can you share a minimal test case? How are you starting your test case? What backend does Open MPI use? I tried it on two nodes with Ethernet connected. How many nodes?

Can you try to downgrade UCX to 1.15 from the update.3.1 directory?

I don't see anything about incompatibilities between 1.15 and 1.17 on the UCX release page.