openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

CL/HIER: check number of TLs per SBGP #919

Closed Sergei-Lebedev closed 7 months ago

Sergei-Lebedev commented 7 months ago

What

Add check to CL/HIER to prevent SBGP TL list overflow.

Why ?

For some systems CL/HIER might try to use more than 4 TLs per SBGP. 4 is compile time constant. Fixes bug https://redmine.mellanox.com/issues/3767158

How ?

Ignore TLs if the list already full. Filter out TLs if we know in advance that SBGP size is not supported.

artemry-nv commented 7 months ago

bot:retest