Open snarayan21 opened 10 months ago
can you please also post the error itself?
Knew I was forgetting something :) updated the description above!
Do you think this may be due to using RoCE?
@snarayan21 Can you post the output of ucx_info -v
?
Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:
[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0] cuda_copy_md.c:341 UCX ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.
Here's the output of ucx_info -v
:
# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37
I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
Describe the bug
I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:
torch.cuda.set_device(rank)
I'm not sure what would be going wrong and would greatly appreciate assistance here!
Error message and stack trace:
Steps to Reproduce
mpirun --allow-run-as-root -np 8 python myscript.py
--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp
Setup and versions
lsmod|grep gdrdrv
gives me:gdrdrv 24576 0
nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX: