openucx / xccl

Other
22 stars 14 forks source link

Multi-cluster collectives #129

Closed pavanbalaji closed 3 years ago

pavanbalaji commented 3 years ago

This PR provides an initial code draft for multi-cluster collectives. It provides a new team that wraps around UCX and NCCL teams and allows them to be used in a hierarchical fashion.

It also imports the "uthash" library. I wonder if that is acceptable or if a different hash library is preferred.

pavanbalaji commented 3 years ago

Sorry, something seems to have gone wrong in my code import. I'm looking into it.

Sergei-Lebedev commented 3 years ago

Sorry, something seems to have gone wrong in my code import. I'm looking into it.

Looks like hierarchical team tries to use it, if it is not expected behavior then you need to disable TL_MPOD in hier team

pavanbalaji commented 3 years ago

Fixed the build issue. Also took the opportunity to rebase on to the latest master.

pavanbalaji commented 3 years ago

Any comments on this PR?

bureddy commented 3 years ago

@pavanbalaji overall it looks fine. it is acceptable here as XCCL is an experimental codebase and it is going to replace with UCC. It needs a broader community acceptance when you port this into UCC. You are welcome to bring this PR into one of the UCC's weekly meetings (https://github.com/openucx/ucc/wiki/UCF-Collectives-Working-Group) and discuss it there.

pavanbalaji commented 3 years ago

@pavanbalaji overall it looks fine. it is acceptable here as XCCL is an experimental codebase and it is going to replace with UCC. It needs a broader community acceptance when you port this into UCC. You are welcome to bring this PR into one of the UCC's weekly meetings (https://github.com/openucx/ucc/wiki/UCF-Collectives-Working-Group) and discuss it there.

Thank you, @bureddy. For now, I think it's good to get it into XCCL first. If it's acceptable, can it be merged in?

bureddy commented 3 years ago

@pavanbalaji do you want this in the master branch (or) torch branch?
I think Pallab/Srinivas is using "torch" branch for FB POCs.

pavanbalaji commented 3 years ago

Hi @bureddy: I was planning to do it in both. Once it gets into master, I was planning to submit another PR for the torch branch.

pavanbalaji commented 3 years ago

Hi @bureddy: Any more comments on this or is this ready to be merged?