Open YangZhou1997 opened 2 days ago
Hi @YangZhou1997
UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions
While possible, this setup hasn’t been tested and would require careful handling to ensure stability.
Thank you Sergey for your quick response! That's super helpful---can I know more about the deadlock? or is there any materials I can read through?
Best, Yang
On Thu, Oct 17, 2024 at 4:25 PM Sergey Lebedev @.***> wrote:
Hi @YangZhou1997 https://github.com/YangZhou1997
UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions
- It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks.
- For reduction collectives, the local source and destination buffers on each rank must have the same memory type.
- Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective.
While possible, this setup hasn’t been tested and would require careful handling to ensure stability.
— Reply to this email directly, view it on GitHub https://github.com/openucx/ucc/issues/1039#issuecomment-2420833796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFTPQQNSP27FDQXJBS2FDLZ4BBN5AVCNFSM6AAAAABQEHTLK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQHAZTGNZZGY . You are receiving this because you were mentioned.Message ID: @.***>
Sure. Deadlock in this case is similar to what we fixed in this PR for a weak asymmetric memory https://github.com/openucx/ucc/pull/1000. Basically UCC tries to choose best transport (UCP, SHM, CUDA, NCCL, RCCL, etc.) based on several factors including memory type of collective. So what might happen is one rank selects NCCL to do allreduce because it see CUDA memory and other chooses RCCL because it sees ROCM memory. This transport selection mismatch will result in deadlock.
Hi ucc maintainer,
I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.
Best, Yang