Closed angainor closed 1 month ago
_mca_coll_ucc_save_collhandlers fails only if there is no other collective component to fallback. Can you please share the command line you use to run this test?
This was tested by
mpirun -mca coll_hcoll_enable 0 -mca coll_ucc_enable 1 -mca coll_ucc_verbose 100 ./osu_alltoall
but except of hcoll there are the standard tuned
collectives, so I think there is a fallback.
Also, the test works when I run on fewer than 18 ranks.
can you pls rerun adding --mca coll_ucc_priority 100, might be related to this issue https://github.com/open-mpi/ompi/issues/9885
Well that seems to do the trick for me! Here are some sample tests for osu_allreduce
on 512 ranks / 4 compute nodes / HDR100 / HPCX 2.17.1. It seems that UCC is still quite a bit slower than HCOLL on this system. Is there anything I could tune here to get better results? note that I set UCX_PROTO_ENABLE=n
, because otherwise HCOLL suffers (https://github.com/openucx/ucx/issues/9914), while this setting seems to have little impact on UCC.
HCOLL
mpirun -x UCX_PROTO_ENABLE=n -H b3115:128,b3142:128,b3331:128,b4132:128 -mca coll_ucc_priority 100 -mca coll_ucc_enable 0 -mca coll_hcoll_enable 1 ./osu_allreduce
# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 7.98
2 7.40
4 7.66
8 7.70
16 7.90
32 8.93
64 7.77
128 8.10
256 10.13
512 10.02
1024 11.73
2048 14.36
4096 19.15
8192 28.63
16384 47.27
32768 498.13
65536 152.94
131072 235.62
262144 443.80
524288 752.25
1048576 1519.43
UCC
mpirun -x UCX_PROTO_ENABLE=n -H b3115:128,b3142:128,b3331:128,b4132:128 -mca coll_ucc_priority 100 -mca coll_ucc_enable 1 -mca coll_hcoll_enable 0 ./osu_allreduce
# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 22.92
2 20.61
4 20.22
8 20.19
16 20.18
32 20.80
64 21.87
128 25.90
256 28.11
512 36.36
1024 56.27
2048 104.75
4096 48.22
8192 58.38
16384 79.96
32768 155.20
65536 265.00
131072 419.92
262144 771.05
524288 1541.54
1048576 3233.63
There was an issue with MPI_CHAR datatype and reductions in HPCX 2.17, I think the fix was included in HPCX 2.18 (https://github.com/openucx/ucc/pull/918). For other cases @nsarka is working on allreduce performance improvements.
thanks a lot! I'll run more perf tests.
Hi, I am compiling OpenMPI 4.1.6 against UCC and UCX shipped with HPCX 2.17.1. Running any OSU collective benchmark (e.g. osu_alltoall) results in a segfault in UCC when I start more than 17 ranks. This is what I get:
I also added verbosity to the ucc mca component, and see the following messages when OpenMPI initializes - not sure if this is relevant:
Does anyone have ideas what the problem might be?
Thanks!