openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/MLX5: adding mcast allgather staging based algo #994

Open MamziB opened 3 days ago

MamziB commented 3 days ago

What

add the algorithm for mcast-based allgather

Why

scalability and performance improvement over sw based allgather

How ?

Realizing the Allgather operations as N (team-size) concurrent Bcast operations (every process becomes the root). We use the one-sided design that was merged before for its reliability.