openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
207 stars 97 forks source link

TL/MLX5: Fix segmentation fault in a2a mpi test #996

Closed x41lakazam closed 3 months ago

x41lakazam commented 4 months ago

Related bug: https://redmine.mellanox.com/issues/3706049

What

Set rcache alignment back from ucc_get_page_size() to UCS_PGT_ADDR_ALIGN Re-activate tl/mlx5 alltoall

Why ?

This bug reproduces only when using ucx anterior to https://github.com/openucx/ucx/commit/85d2d9d0f, which introduced dynamic rcache alignment. https://github.com/openucx/ucc/pull/877 (specifically https://github.com/openucx/ucc/commit/b13b87d0a32a326bf60a322525979e5e12533807) sets the alignment to ucc_get_page_size() whereas it was UCS_PGT_ADDR_ALIGN before. Setting the alignment back to UCS_PGT_ADDR_ALIGN solves the bug.

The reason is yet to be found out.

Performance tests

Below is a comparison of the performances with tl/ucp and hcoll

TL/UCP:

$ mpirun \
--mca coll_ucc_cts alltoall \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
-x UCC_COLL_TRACE=info \
-x LD_PRELOAD=$PWD/install/lib/libucc.so \
-x UCX_NET_DEVICES=mlx5_2:1 \
-x UCC_TL_UCP_TUNE=inf \
-np 736 \
--map-by ppr:32:node \
$HPCX_OSU_DIR/osu_alltoall -m 1:128 -f

# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 8758.46 7628.46 9876.10 1000
2 8769.55 7674.58 9880.20 1000
4 8789.50 7658.54 9896.35 1000
8 8781.75 7664.65 9883.95 1000
16 8799.08 7647.83 9899.20 1000
32 8921.46 7744.85 10028.75 1000
64 8585.21 7491.80 9682.50 1000
128 8458.66 7740.09 9893.40 1000

TL/MLX5:

$ mpirun \
--mca coll_ucc_cts alltoall \
--mca coll_ucc_enable 1 \
--mca coll_hcoll_enable 0 \
-x UCC_COLL_TRACE=info \
-x LD_PRELOAD=$PWD/install/lib/libucc.so \
-x UCC_TL_MLXS_NET_DEVICES=mlx5_0:1 \
-x UCX_NET_DEVICES=mlx5_2:1 \
-x UCC_TL_MLX5_TUNE=inf \
-x UCC_TL_SHARP_TUNE=0 \
-x UCC_RC_MLX5_DM_COUNT=0 \
-x UCC_DC_MLX5_DM_COUNT=0 \
-np 736 \
--map-by ppr:32:node \
$HPCX_OSU_DIR/osu_alltoall -m 1:128 -f
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 95.85 60.23 114.01 1000
2 95.76 60.02 114.46 1000
4 100.96 68.69 119.43 1000
8 112.16 76.55 130.74 1000
16 178.82 132.88 210.77 1000
32 201.71 156.84 233.80 1000
64 336.13 270.47 431.30 1000
128 643.67 505.20 762.88 1000

HCOLL:

$ mpirun \
  --mca coll_ucc_enable 0 \
  --mca coll_hcoll_enable 1 \
  -x UCC_COLL_TRACE=info \
  -np $np \
  --map-by ppr:$ppn:node \
  $HPCX_OSU_DIR/osu_alltoall -m 1:128 -f

# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
     1                     112.15             73.49            148.97        1000
     2                     118.68             75.98            167.90        1000
     4                     139.34             79.86            189.09        1000
     8                     301.91            266.75            329.19        1000
     16                    119.70             88.27            145.37        1000
     32                    235.14            183.34            277.67        1000
     64                    379.98            301.66            454.02        1000
     128                   578.52            438.08            713.28        1000