openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
196 stars 96 forks source link

TL/MLX5: fix context create hang #887

Closed Sergei-Lebedev closed 10 months ago

Sergei-Lebedev commented 10 months ago

What

Fix hanging in TL MLX5 context create.

How ?

PD_OWNER_RANK doesn't start service bcast If no IB devices found, other ranks hang in sbcast waiting for PD_OWNER_RANK

samnordmann commented 10 months ago

Is it valid to do sbcast if ppn=1 ?

Nevermind, I missed the part handling it