Closed pavanbalaji closed 3 years ago
Sorry, something seems to have gone wrong in my code import. I'm looking into it.
Sorry, something seems to have gone wrong in my code import. I'm looking into it.
Looks like hierarchical team tries to use it, if it is not expected behavior then you need to disable TL_MPOD in hier team
Fixed the build issue. Also took the opportunity to rebase on to the latest master.
Any comments on this PR?
@pavanbalaji overall it looks fine. it is acceptable here as XCCL is an experimental codebase and it is going to replace with UCC. It needs a broader community acceptance when you port this into UCC. You are welcome to bring this PR into one of the UCC's weekly meetings (https://github.com/openucx/ucc/wiki/UCF-Collectives-Working-Group) and discuss it there.
@pavanbalaji overall it looks fine. it is acceptable here as XCCL is an experimental codebase and it is going to replace with UCC. It needs a broader community acceptance when you port this into UCC. You are welcome to bring this PR into one of the UCC's weekly meetings (https://github.com/openucx/ucc/wiki/UCF-Collectives-Working-Group) and discuss it there.
Thank you, @bureddy. For now, I think it's good to get it into XCCL first. If it's acceptable, can it be merged in?
@pavanbalaji do you want this in the master branch (or) torch branch?
I think Pallab/Srinivas is using "torch" branch for FB POCs.
Hi @bureddy: I was planning to do it in both. Once it gets into master, I was planning to submit another PR for the torch branch.
Hi @bureddy: Any more comments on this or is this ready to be merged?
This PR provides an initial code draft for multi-cluster collectives. It provides a new team that wraps around UCX and NCCL teams and allows them to be used in a hierarchical fashion.
It also imports the "uthash" library. I wonder if that is acceptable or if a different hash library is preferred.