openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
200 stars 97 forks source link

Question about communication between Nvidia and AMD GPUs #1039

Open YangZhou1997 opened 2 days ago

YangZhou1997 commented 2 days ago

Hi ucc maintainer,

I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.

Best, Yang

Sergei-Lebedev commented 1 day ago

Hi @YangZhou1997

UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions

  1. It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks.
  2. For reduction collectives, the local source and destination buffers on each rank must have the same memory type.
  3. Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective.

While possible, this setup hasn’t been tested and would require careful handling to ensure stability.

YangZhou1997 commented 1 day ago

Thank you Sergey for your quick response! That's super helpful---can I know more about the deadlock? or is there any materials I can read through?

Best, Yang

On Thu, Oct 17, 2024 at 4:25 PM Sergey Lebedev @.***> wrote:

Hi @YangZhou1997 https://github.com/YangZhou1997

UCC can theoretically support collective communication across Nvidia and AMD GPUs in a single workload, but with key restrictions

  1. It will only work with TL UCP and TL SHARP. Other transports aren’t compatible due to non-homogeneous memory, which can cause deadlocks.
  2. For reduction collectives, the local source and destination buffers on each rank must have the same memory type.
  3. Deadlocks from memory mismatches could be avoided by running a small allreduce before each collective.

While possible, this setup hasn’t been tested and would require careful handling to ensure stability.

— Reply to this email directly, view it on GitHub https://github.com/openucx/ucc/issues/1039#issuecomment-2420833796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFTPQQNSP27FDQXJBS2FDLZ4BBN5AVCNFSM6AAAAABQEHTLK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQHAZTGNZZGY . You are receiving this because you were mentioned.Message ID: @.***>

Sergei-Lebedev commented 1 day ago

Sure. Deadlock in this case is similar to what we fixed in this PR for a weak asymmetric memory https://github.com/openucx/ucc/pull/1000. Basically UCC tries to choose best transport (UCP, SHM, CUDA, NCCL, RCCL, etc.) based on several factors including memory type of collective. So what might happen is one rank selects NCCL to do allreduce because it see CUDA memory and other chooses RCCL because it sees ROCM memory. This transport selection mismatch will result in deadlock.