Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs)

Hi UCC Team,

Following the resolution of my initial issue #1034, I am now extending the proof-of-concept (PoC) to test distributed ML workflows using PyTorch with a heterogeneous setup: g4ad.xlarge (AMD ROCm) and g4dn.xlarge (NVIDIA CUDA) instances.

(All relevant code, log outputs, and observations to provide additional context can be looked up here - https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Summary of Setup

Software and Hardware Configuration

ROCm (AMD): g4ad.xlarge instance with AMD Radeon Pro V520 (RDNA1 architecture, gfx1011 shader) and ROCm 6.2.2.
CUDA (NVIDIA): g4dn.xlarge instance with NVIDIA T4 GPU (Turing architecture) and CUDA 12.4.
MPI: Built with UCC and UCX as the transport layer.
Environment: Containers for each GPU type, configured with UCX and UCC builds (see the paragraph describing Dockerfiles with installation scripts and config outputs for UCC and UCX here):.

Observed Behavior

Bi-Directional Communication (test code here): Running a basic bidirectional send_recv test in PyTorch was successful, confirming basic communication across the GPUs (logs available here).
Allreduce Operation (test code here):
- With UCC Collectives Enabled: The allreduce operation fails consistently on the ucp_tag_send_nbx function for both ranks in PyTorch (logs and backtrace available here).
- Without UCC Collectives for MPI: The allreduce completes successfully on the CUDA rank but fails on the ROCm rank (logs and backtrace available here).

Request for Technical Insights

To better understand and resolve the failure in ucp_tag_send_nbx during the allreduce operation, I would appreciate guidance on the following:

Potential Causes of ucp_tag_send_nbx Failures: Could you provide technical insights into why the ucp_tag_send_nbx operation might fail within a mixed GPU environment (CUDA and ROCm) under UCC? I have examined logs and stack traces, but a deeper understanding of specific communication or memory operations that might impact ucp_tag_send_nbx in heterogeneous setups would be helpful.
Additional Diagnostic Tests: If there are specific configurations, environment variables, or diagnostic flags that could help reveal more details about the UCX and UCC behaviors in this setup, I would be glad to run further tests.

Thank you for your continued support and assistance with this project!

@RafalSiwek let me comment on the part that I am confident about: I don't think the MPI collective without UCC can work (at least not for reductions): you might see different components being selected for the process running the on the cuda-ip node vs. the process running on the rocm-ip node. In my opinion, UCC using tl/ucp is your best (only?) choice at the moment for this configuration.

Regarding the ucp_tag_send_nbx failure: I am not entirely sure since the simple send-recv test worked. I would recommend to try to run something like the osu_latency or osu_bw benchmark across the two nodes, that would probably stress the system/UCX a bit more in this scenario vs. just a single message in both directions. If the osu_latency benchmark using device memory works, there is reasonable chance that the UCX side of the software stack is working.

openucx / ucc