Open devreal opened 2 years ago
I'm trying to run coll/han with coll/hcoll as a backend but see the following issue on both main and 5.0.x on Hawk (ConnectX-6 fabric):
coll/han
coll/hcoll
main
5.0.x
mpirun -N 4 -n 16 --mca coll_han_priority 100 --mca coll_adapt_priority 0 --mca coll_hcoll_enable 1 --mca coll_tuned_priority 10 --mca coll_hcoll_priority 80 ~/src/osu-benchmarks/osu-micro-benchmarks-5.6.2/build/mpi/collective/osu_reduce # OSU MPI Reduce Latency Test v5.6.2 # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations [r37c4t7n4:44848] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t7n4:44846] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t7n4:44845] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t7n4:44847] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n3:52753] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n4:64546] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n2:136568] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n3:52752] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n4:64544] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n3:52751] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n4:64547] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n3:52754] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n4:64545] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n2:136569] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n2:136567] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed [r37c4t8n2:136566] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:241 - mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed 4 6.28 0.53 51.10 1000 8 6.13 0.54 48.90 1000 16 12.88 0.56 74.71 1000 32 12.92 0.58 74.65 1000 64 12.93 0.61 74.62 1000 128 13.18 0.65 75.69 1000 256 13.42 0.62 77.71 1000 512 14.28 0.67 83.06 1000 1024 14.93 0.81 85.44 1000 2048 17.71 0.94 100.48 1000 4096 39.10 4.10 227.27 1000 8192 50.44 4.67 294.51 1000 16384 64.91 6.21 375.91 1000 32768 100.64 9.80 583.28 1000 65536 171.64 22.55 966.53 100 131072 199.31 34.74 997.37 100 262144 259.61 68.76 927.96 100 524288 588.19 144.79 1817.62 100 1048576 1988.30 1510.59 3094.54 100
At the end of the run I get a Segfault:
==== backtrace (tid: 136568) ==== 0 libucs.so.0(ucs_handle_error+0x254) [0x7fe976256594] 1 libucs.so.0(+0x2d777) [0x7fe976256777] 2 libucs.so.0(+0x2da4e) [0x7fe976256a4e] 3 /lib64/libpthread.so.0(+0x12b20) [0x7fe977b53b20] 4 /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_update_context_cache_on_group_destruction+0x9e) [0x7fe9774ff84e] 5 /opt/mellanox/hcoll/lib/libhcoll.so.1(hcoll_context_free+0x148) [0x7fe9774fd4c8] 6 libmpi.so.80(+0x1102d3) [0x7fe9787a02d3] 7 libmpi.so.80(+0x678f9) [0x7fe9786f78f9] 8 libmpi.so.80(ompi_attr_delete_all+0x173) [0x7fe9786f9293] 9 libmpi.so.80(ompi_comm_free+0x3c) [0x7fe9786fc5ec] 10 libmpi.so.80(+0x150246) [0x7fe9787e0246] 11 libmpi.so.80(mca_coll_base_comm_unselect+0x1d79) [0x7fe97878a379] 12 libmpi.so.80(+0x69cbc) [0x7fe9786f9cbc] 13 libmpi.so.80(+0x6a2c9) [0x7fe9786fa2c9] 14 libopen-pal.so.80(opal_finalize_cleanup_domain+0x4a) [0x7fe978b1292a] 15 libopen-pal.so.80(opal_finalize+0x3f) [0x7fe978b12a9f] 16 libmpi.so.80(ompi_rte_finalize+0x13a) [0x7fe9787281fa] 17 libmpi.so.80(+0x9df6c) [0x7fe97872df6c] 18 libmpi.so.80(ompi_mpi_instance_finalize+0xc5) [0x7fe97872f595] 19 libmpi.so.80(ompi_mpi_finalize+0x163) [0x7fe978723f73] 20 osu_reduce() [0x402716] 21 /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fe97779f493] 22 osu_reduce() [0x40294e] =================================
It looks like coll/hcoll sends a corrupted/invalid context to hcoll_context_free.
hcoll_context_free
Looks like the same cause as #9885 (?)
I'm trying to run
coll/han
withcoll/hcoll
as a backend but see the following issue on bothmain
and5.0.x
on Hawk (ConnectX-6 fabric):At the end of the run I get a Segfault:
It looks like coll/hcoll sends a corrupted/invalid context to
hcoll_context_free
.