Open samnordmann opened 4 weeks ago
I just figured out that adding -x UCC_CL_BASIC_TLS=^mlx5
solves that bug. In the debug log we see that the non-master rank 1 prints [1730120424.285612] [dgx1v-loki-23:26352:0] ucc_context.c:817 UCC DEBUG ctx create epilog for mlx5 failed: Unhandled error
, then enters into context cleanup, which contains a barrier, while rank 0 init mlx5 successfully, hence the hang
[1730117228.178968] [dgx1v-loki-23:3000 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7f65a8000000 len:16777216 err:201
# API headers version: 1.18.0, Git branch 'master', revision 9da106a