openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/MLX5: adding mcast helper functions #880

Closed MamziB closed 5 months ago

MamziB commented 7 months ago

TL/MLX5: adding mcast helper functions

MamziB commented 7 months ago

@Sergei-Lebedev Thanks for the comments, I appreciate your time. I incorporated your comments, and updated the commit. @samnordmann Please kindly take a look at this PR and let me know your thoughts. Your constructive comments are much appreciated.
Please note that HPC_SDK fails somewhere out of this PR scope:

"/__w/ucc/ucc/src/coll_patterns/double_binary_tree.h", line 55: error #68: integer conversion resulted in a change of sign [integer_sign_change]
          return -1;
MamziB commented 7 months ago

@samnordmann Thanks for all the comments I appreciate it. @Sergei-Lebedev Please let me know if you have more comment otherwise we can finalize this PR

MamziB commented 7 months ago

Thank you @Sergei-Lebedev for the latest comments, I have updated the commit. @samnordmann Please kindly let me know if you have further thoughts about the PR otherwise we can finalize it

MamziB commented 7 months ago

@Sergei-Lebedev build UCC is failing in this PR because of some issues unrelated to it:

"/__w/ucc/ucc/src/coll_patterns/double_binary_tree.h", line 139: error #68: integer conversion resulted in a change of sign [integer_sign_change]
Sergei-Lebedev commented 6 months ago

@Sergei-Lebedev build UCC is failing in this PR because of some issues unrelated to it:

"/__w/ucc/ucc/src/coll_patterns/double_binary_tree.h", line 139: error #68: integer conversion resulted in a change of sign [integer_sign_change]

this issue is fixed, but now CI fails due to another issue that seems to be relevant to this PR

swx-clx02:rank4.lt-ucc_test_mpi: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
MamziB commented 6 months ago

@Sergei-Lebedev This error is not related to this PR. If it was, it should have printed this:

failed to move mcast qp to INIT
MamziB commented 6 months ago

@Sergei-Lebedev I just updated the commit. Please let me know if you have further comments. Thanks

Sergei-Lebedev commented 6 months ago

@MamziB please fix CI failures