Open RafalSiwek opened 3 hours ago
@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration.
Regarding the ucp_tag_send_nbx
failure: I am not entirely sure since the simple send-recv test worked. I would recommend to try to run something like the osu_latency
or osu_bw
benchmark across the two nodes, that would probably stress the system/UCX a bit more in this scenario vs. just a single message in both directions. If the osu_latency benchmark using device memory works, there is reasonable chance that the UCX side of the software stack is working.
Hi UCC Team,
Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup:
g4ad.xlarge
(AMD ROCm) andg4dn.xlarge
(NVIDIA CUDA) instances.(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)
Summary of Setup
Software and Hardware Configuration
g4ad.xlarge
instance with AMD Radeon Pro V520 (RDNA1 architecture,gfx1011
shader) and ROCm 6.2.2.g4dn.xlarge
instance with NVIDIA T4 GPU (Turing architecture) and CUDA 12.4.Observed Behavior
Bi-Directional Communication (test code here): Running a basic bidirectional
send_recv
test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).Allreduce Operation (test code here):
allreduce
operation fails consistently on theucp_tag_send_nbx
function for both ranks in PyTorch (logs and backtrace available here).allreduce
completes successfully on the CUDA rank but fails on the ROCm rank (logs and backtrace available here).Request for Technical Insights
To better understand and resolve the failure in
ucp_tag_send_nbx
during theallreduce
operation, I would appreciate guidance on the following:Potential Causes of
ucp_tag_send_nbx
Failures: Could you provide technical insights into why theucp_tag_send_nbx
operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impactucp_tag_send_nbx
in heterogeneous setups would be helpful.Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.
Thank you for your continued support and assistance with this project!