open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

HCOLL fails to save coll handlers and ends in Segfault when used with HAN #10718

Open devreal opened 2 years ago

devreal commented 2 years ago

I'm trying to run coll/han with coll/hcoll as a backend but see the following issue on both main and 5.0.x on Hawk (ConnectX-6 fabric):

mpirun -N 4 -n 16 --mca coll_han_priority 100 --mca coll_adapt_priority 0 --mca coll_hcoll_enable 1 --mca coll_tuned_priority 10 --mca coll_hcoll_priority 80 ~/src/osu-benchmarks/osu-micro-benchmarks-5.6.2/build/mpi/collective/osu_reduce

# OSU MPI Reduce Latency Test v5.6.2
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[r37c4t7n4:44848] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44846] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44845] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t7n4:44847] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52753] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64546] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136568] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52752] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64544] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52751] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64547] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n3:52754] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n4:64545] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136569] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136567] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
[r37c4t8n2:136566] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
4                       6.28              0.53             51.10        1000
8                       6.13              0.54             48.90        1000
16                     12.88              0.56             74.71        1000
32                     12.92              0.58             74.65        1000
64                     12.93              0.61             74.62        1000
128                    13.18              0.65             75.69        1000
256                    13.42              0.62             77.71        1000
512                    14.28              0.67             83.06        1000
1024                   14.93              0.81             85.44        1000
2048                   17.71              0.94            100.48        1000
4096                   39.10              4.10            227.27        1000
8192                   50.44              4.67            294.51        1000
16384                  64.91              6.21            375.91        1000
32768                 100.64              9.80            583.28        1000
65536                 171.64             22.55            966.53         100
131072                199.31             34.74            997.37         100
262144                259.61             68.76            927.96         100
524288                588.19            144.79           1817.62         100
1048576              1988.30           1510.59           3094.54         100

At the end of the run I get a Segfault:

==== backtrace (tid: 136568) ====
 0  libucs.so.0(ucs_handle_error+0x254) [0x7fe976256594]
 1  libucs.so.0(+0x2d777) [0x7fe976256777]
 2  libucs.so.0(+0x2da4e) [0x7fe976256a4e]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7fe977b53b20]
 4  /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_update_context_cache_on_group_destruction+0x9e) [0x7fe9774ff84e]
 5  /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_context_free+0x148) [0x7fe9774fd4c8]
 6  libmpi.so.80(+0x1102d3) [0x7fe9787a02d3]
 7  libmpi.so.80(+0x678f9) [0x7fe9786f78f9]
 8  libmpi.so.80(ompi_attr_delete_all+0x173) [0x7fe9786f9293]
 9  libmpi.so.80(ompi_comm_free+0x3c) [0x7fe9786fc5ec]
10  libmpi.so.80(+0x150246) [0x7fe9787e0246]
11  libmpi.so.80(mca_coll_base_comm_unselect+0x1d79) [0x7fe97878a379]
12  libmpi.so.80(+0x69cbc) [0x7fe9786f9cbc]
13  libmpi.so.80(+0x6a2c9) [0x7fe9786fa2c9]
14  libopen-pal.so.80(opal_finalize_cleanup_domain+0x4a) [0x7fe978b1292a]
15  libopen-pal.so.80(opal_finalize+0x3f) [0x7fe978b12a9f]
16  libmpi.so.80(ompi_rte_finalize+0x13a) [0x7fe9787281fa]
17  libmpi.so.80(+0x9df6c) [0x7fe97872df6c]
18  libmpi.so.80(ompi_mpi_instance_finalize+0xc5) [0x7fe97872f595]
19  libmpi.so.80(ompi_mpi_finalize+0x163) [0x7fe978723f73]
20  osu_reduce() [0x402716]
21  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fe97779f493]
22  osu_reduce() [0x40294e]
=================================

It looks like coll/hcoll sends a corrupted/invalid context to hcoll_context_free.

gkatev commented 2 years ago

Looks like the same cause as #9885 (?)