openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/MLX5: Disable mlx5 team by default #966

Closed janjust closed 2 months ago

janjust commented 2 months ago

This PR increases the default min size supported for tl_mlx5 to uint32_max thus disabling tl_mlx5 team unless the users increase the min size via env. variable

We have noticed several corner-case bugs in tl_mlx5 algorithms, therefore until the features mature we disable tl_mlx5 when the team is built rather than not building the team at all.

janjust commented 2 months ago

Without setting -x UCC_TL_MLX5_MIN_TEAM_SIZE= [1713901146.159619] [cascade1:17008:0] ucc_tl.c:293 TL_MLX5 DEBUG team size 16 is too small, min supported -2

And when a user wants to enable tl_mlx5, he needs to increase it's min size as well. eg: -x UCC_TL_MLX5_MIN_TEAM_SIZE=2 [1713901217.005463] [cascade1:17191:0] tl_mlx5_team.c:99 TL_MLX5 DEBUG finalizing tl team: 0x329b700

samnordmann commented 2 months ago

Why not using a flag to deactivate the tl then? "Min team size" has another meaning and using it for that purpose sounds like a hack. We should be able to deactivate the tl at the core level.

Since MCAST_ENABLE is disabled by default, should I understand that the problem comes from a2a team creation?

janjust commented 2 months ago

Decided we'll do a different approach, closing this.