openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
207 stars 97 forks source link

Error and hang on v100 #1041

Open samnordmann opened 4 weeks ago

samnordmann commented 4 weeks ago
samnordmann commented 4 weeks ago

I just figured out that adding -x UCC_CL_BASIC_TLS=^mlx5 solves that bug. In the debug log we see that the non-master rank 1 prints [1730120424.285612] [dgx1v-loki-23:26352:0] ucc_context.c:817 UCC DEBUG ctx create epilog for mlx5 failed: Unhandled error, then enters into context cleanup, which contains a barrier, while rank 0 init mlx5 successfully, hence the hang