Closed x41lakazam closed 3 months ago
Related bug: https://redmine.mellanox.com/issues/3706049
Set rcache alignment back from ucc_get_page_size() to UCS_PGT_ADDR_ALIGN Re-activate tl/mlx5 alltoall
ucc_get_page_size()
UCS_PGT_ADDR_ALIGN
This bug reproduces only when using ucx anterior to https://github.com/openucx/ucx/commit/85d2d9d0f, which introduced dynamic rcache alignment. https://github.com/openucx/ucc/pull/877 (specifically https://github.com/openucx/ucc/commit/b13b87d0a32a326bf60a322525979e5e12533807) sets the alignment to ucc_get_page_size() whereas it was UCS_PGT_ADDR_ALIGN before. Setting the alignment back to UCS_PGT_ADDR_ALIGN solves the bug.
The reason is yet to be found out.
Below is a comparison of the performances with tl/ucp and hcoll
TL/UCP: $ mpirun \ --mca coll_ucc_cts alltoall \ --mca coll_ucc_enable 1 \ --mca coll_hcoll_enable 0 \ -x UCC_COLL_TRACE=info \ -x LD_PRELOAD=$PWD/install/lib/libucc.so \ -x UCX_NET_DEVICES=mlx5_2:1 \ -x UCC_TL_UCP_TUNE=inf \ -np 736 \ --map-by ppr:32:node \ $HPCX_OSU_DIR/osu_alltoall -m 1:128 -f # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 1 8758.46 7628.46 9876.10 1000 2 8769.55 7674.58 9880.20 1000 4 8789.50 7658.54 9896.35 1000 8 8781.75 7664.65 9883.95 1000 16 8799.08 7647.83 9899.20 1000 32 8921.46 7744.85 10028.75 1000 64 8585.21 7491.80 9682.50 1000 128 8458.66 7740.09 9893.40 1000 TL/MLX5: $ mpirun \ --mca coll_ucc_cts alltoall \ --mca coll_ucc_enable 1 \ --mca coll_hcoll_enable 0 \ -x UCC_COLL_TRACE=info \ -x LD_PRELOAD=$PWD/install/lib/libucc.so \ -x UCC_TL_MLXS_NET_DEVICES=mlx5_0:1 \ -x UCX_NET_DEVICES=mlx5_2:1 \ -x UCC_TL_MLX5_TUNE=inf \ -x UCC_TL_SHARP_TUNE=0 \ -x UCC_RC_MLX5_DM_COUNT=0 \ -x UCC_DC_MLX5_DM_COUNT=0 \ -np 736 \ --map-by ppr:32:node \ $HPCX_OSU_DIR/osu_alltoall -m 1:128 -f # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 1 95.85 60.23 114.01 1000 2 95.76 60.02 114.46 1000 4 100.96 68.69 119.43 1000 8 112.16 76.55 130.74 1000 16 178.82 132.88 210.77 1000 32 201.71 156.84 233.80 1000 64 336.13 270.47 431.30 1000 128 643.67 505.20 762.88 1000 HCOLL: $ mpirun \ --mca coll_ucc_enable 0 \ --mca coll_hcoll_enable 1 \ -x UCC_COLL_TRACE=info \ -np $np \ --map-by ppr:$ppn:node \ $HPCX_OSU_DIR/osu_alltoall -m 1:128 -f # OSU MPI All-to-All Personalized Exchange Latency Test v7.2 # Datatype: MPI_CHAR. # Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations 1 112.15 73.49 148.97 1000 2 118.68 75.98 167.90 1000 4 139.34 79.86 189.09 1000 8 301.91 266.75 329.19 1000 16 119.70 88.27 145.37 1000 32 235.14 183.34 277.67 1000 64 379.98 301.66 454.02 1000 128 578.52 438.08 713.28 1000
Related bug: https://redmine.mellanox.com/issues/3706049
What
Set rcache alignment back from
ucc_get_page_size()
toUCS_PGT_ADDR_ALIGN
Re-activate tl/mlx5 alltoallWhy ?
This bug reproduces only when using ucx anterior to https://github.com/openucx/ucx/commit/85d2d9d0f, which introduced dynamic rcache alignment. https://github.com/openucx/ucc/pull/877 (specifically https://github.com/openucx/ucc/commit/b13b87d0a32a326bf60a322525979e5e12533807) sets the alignment to
ucc_get_page_size()
whereas it wasUCS_PGT_ADDR_ALIGN
before. Setting the alignment back toUCS_PGT_ADDR_ALIGN
solves the bug.The reason is yet to be found out.
Performance tests
Below is a comparison of the performances with tl/ucp and hcoll