openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 419 forks source link

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

Open snarayan21 opened 10 months ago

snarayan21 commented 10 months ago

Describe the bug

I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:

I'm not sure what would be going wrong and would greatly appreciate assistance here!

Error message and stack trace:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0:768657] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f5b05e00000)
==== backtrace (tid: 768657) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5c1eae82b4]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x304af) [0x7f5c1eae84af]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x30796) [0x7f5c1eae8796]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72) [0x7f605c51ca72]
 4  /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93) [0x7f5c1ea959e3]
 5  /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d) [0x7f5c1ebade9d]
 6  /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735) [0x7f5c1ebb9365]
 7  /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab) [0x7f5c2403627b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b) [0x7f605bb4582b]
 9  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2) [0x7f605bb45ed2]
10  /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40) [0x7f5c1deb0840]
11  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41) [0x7f605bb20841]
12  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4) [0x7f600c8168a4]
13  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2) [0x7f600c81d3d2]
14  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5fb34b0253]
15  /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f605c40aac3]
16  /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f605c49ca40]
=================================
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** Process received signal ***
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal: Segmentation fault (11)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal code:  (-6)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Failing at address: 0xbb417
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f605c3b8520]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72)[0x7f605c51ca72]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93)[0x7f5c1ea959e3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d)[0x7f5c1ebade9d]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 4] /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735)[0x7f5c1ebb9365]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab)[0x7f5c2403627b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f605bb4582b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x7f605bb45ed2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 8] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f5c1deb0840]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f605bb20841]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4)[0x7f600c8168a4]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2)[0x7f600c81d3d2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [12] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5fb34b0253]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f605c40aac3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f605c49ca40]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** End of error message ***

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

brminich commented 10 months ago

can you please also post the error itself?

snarayan21 commented 10 months ago

Knew I was forgetting something :) updated the description above!

snarayan21 commented 10 months ago

Do you think this may be due to using RoCE?

Akshay-Venkatesh commented 10 months ago

@snarayan21 Can you post the output of ucx_info -v?

Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context

Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.

snarayan21 commented 9 months ago

Here's the output of ucx_info -v:

# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html