Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

snarayan21 commented 10 months ago

Describe the bug

I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:

change UCX_TLS environment variable
explicitly setting torch.cuda.set_device(rank)
reinstalling UCX with gdrcopy (which seems to not be recognized when running ucx_info -d)

I'm not sure what would be going wrong and would greatly appreciate assistance here!

Error message and stack trace:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0:768657] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f5b05e00000)
==== backtrace (tid: 768657) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5c1eae82b4]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x304af) [0x7f5c1eae84af]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x30796) [0x7f5c1eae8796]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72) [0x7f605c51ca72]
 4  /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93) [0x7f5c1ea959e3]
 5  /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d) [0x7f5c1ebade9d]
 6  /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735) [0x7f5c1ebb9365]
 7  /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab) [0x7f5c2403627b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b) [0x7f605bb4582b]
 9  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2) [0x7f605bb45ed2]
10  /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40) [0x7f5c1deb0840]
11  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41) [0x7f605bb20841]
12  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4) [0x7f600c8168a4]
13  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2) [0x7f600c81d3d2]
14  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5fb34b0253]
15  /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f605c40aac3]
16  /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f605c49ca40]
=================================
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** Process received signal ***
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal: Segmentation fault (11)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal code:  (-6)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Failing at address: 0xbb417
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f605c3b8520]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72)[0x7f605c51ca72]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93)[0x7f5c1ea959e3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d)[0x7f5c1ebade9d]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 4] /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735)[0x7f5c1ebb9365]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab)[0x7f5c2403627b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f605bb4582b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x7f605bb45ed2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 8] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f5c1deb0840]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f605bb20841]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4)[0x7f600c8168a4]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2)[0x7f600c81d3d2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [12] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5fb34b0253]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f605c40aac3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f605c49ca40]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** End of error message ***

Steps to Reproduce

ran mpirun --allow-run-as-root -np 8 python myscript.py
UCX version used: 1.15.0
UCX configure flags: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
Any UCX environment variables used: UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp

Setup and versions

OS version + CPU architecture
- Ubuntu 22.04.3 LTS on x86_64 GNU/Linux
For GPU related issues:
- GPU type: H100-80GB
- Cuda:
  - Drivers version: 535.86.10
  - Check if peer-direct is loaded: The lsmod|grep gdrdrv gives me: gdrdrv 24576 0 nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

OpenMPI version: 4.1.5rc2

Output of ucx_info -d to show transports and devices recognized by UCX:

#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: eth0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 1129.60/ppn + 0.00 MB/sec
#              latency: 5258 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 990751528K
#           remote key: 32 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: ep_check
#
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#         memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#
#      Transport: cuda_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#           memory invalidation is supported
#         memory types: cuda (access,reg,cache)
#
#      Transport: cuda_ipc
#         Device: cuda
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 400000.00/ppn + 0.00 MB/sec
#              latency: 1000 nsec
#             overhead: 7000 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: peer failure, ep_check
#

brminich commented 10 months ago

can you please also post the error itself?

snarayan21 commented 10 months ago

Knew I was forgetting something :) updated the description above!

snarayan21 commented 10 months ago

Do you think this may be due to using RoCE?

Akshay-Venkatesh commented 10 months ago

@snarayan21 Can you post the output of ucx_info -v?

Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context

Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.

snarayan21 commented 9 months ago

Here's the output of ucx_info -v:

# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html

openucx / ucx