Closed janjust closed 2 months ago
Without setting -x UCC_TL_MLX5_MIN_TEAM_SIZE=
[1713901146.159619] [cascade1:17008:0] ucc_tl.c:293 TL_MLX5 DEBUG team size 16 is too small, min supported -2
And when a user wants to enable tl_mlx5, he needs to increase it's min size as well.
eg: -x UCC_TL_MLX5_MIN_TEAM_SIZE=2
[1713901217.005463] [cascade1:17191:0] tl_mlx5_team.c:99 TL_MLX5 DEBUG finalizing tl team: 0x329b700
Why not using a flag to deactivate the tl then? "Min team size" has another meaning and using it for that purpose sounds like a hack. We should be able to deactivate the tl at the core level.
Since MCAST_ENABLE
is disabled by default, should I understand that the problem comes from a2a team creation?
Decided we'll do a different approach, closing this.
This PR increases the default min size supported for tl_mlx5 to uint32_max thus disabling tl_mlx5 team unless the users increase the min size via env. variable
We have noticed several corner-case bugs in tl_mlx5 algorithms, therefore until the features mature we disable tl_mlx5 when the team is built rather than not building the team at all.