openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
205 stars 96 forks source link

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Open RafalSiwek opened 3 hours ago

RafalSiwek commented 3 hours ago

Hi UCC Team,

Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup: g4ad.xlarge (AMD ROCm) and g4dn.xlarge (NVIDIA CUDA) instances.

(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Summary of Setup

Software and Hardware Configuration

Observed Behavior

  1. Bi-Directional Communication (test code here): Running a basic bidirectional send_recv test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).

  2. Allreduce Operation (test code here):

Request for Technical Insights

To better understand and resolve the failure in ucp_tag_send_nbx during the allreduce operation, I would appreciate guidance on the following:

  1. Potential Causes of ucp_tag_send_nbx Failures: Could you provide technical insights into why the ucp_tag_send_nbx operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impact ucp_tag_send_nbx in heterogeneous setups would be helpful.

  2. Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.

Thank you for your continued support and assistance with this project!

edgargabriel commented 2 hours ago

@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration.

Regarding the ucp_tag_send_nbx failure: I am not entirely sure since the simple send-recv test worked. I would recommend to try to run something like the osu_latency or osu_bw benchmark across the two nodes, that would probably stress the system/UCX a bit more in this scenario vs. just a single message in both directions. If the osu_latency benchmark using device memory works, there is reasonable chance that the UCX side of the software stack is working.