openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/MLX5: fix team init error handling flow #953

Closed samnordmann closed 2 months ago

samnordmann commented 2 months ago

What

Fix the error handling to avoid segfault when tl/mlx5/a2a team creation fails

Sergei-Lebedev commented 2 months ago

does this change is already addressed by https://github.com/openucx/ucc/pull/946 or it's additional fix?

samnordmann commented 2 months ago

does this change is already addressed by #946 or it's additional fix?

It's an additional fix

MamziB commented 2 months ago

@manjugv can you please take a look? it is critical as mtt is failing because of it and Daria has reported it to us