rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.74k stars 304 forks source link

Issues running C++ multi-GPU test #4601

Closed sg0 closed 2 months ago

sg0 commented 3 months ago

What is your question?

This is the second part of: https://github.com/rapidsai/cugraph/issues/4596

I am trying to run the multi-GPU test (https://github.com/rapidsai/cugraph/blob/branch-24.10/cpp/examples/users/multi_gpu_application/mg_graph_algorithms.cpp) on a single node, this is my job script:

 #!/bin/bash

 #SBATCH -t 01:00:00
 #SBATCH -N 1
 #SBATCH -n 8
 #SBATCH --gres=gpu:8
 #SBATCH --constraint=nvlink
 #SBATCH -p a100
 #SBATCH -J CUGXX
 #SBATCH -o CUGXX_%A_%a.out
 #SBATCH -e CUGXX_%A_%a.err

 source /etc/profile.d/modules.sh
 module load gcc/12.2.0
 module load openmpi/4.1.4
 module load cuda/12.1
 module load cmake/3.28.1
 module load python/miniconda24.4.0

source /share/apps/python/miniconda24.4.0/etc/profile.d/conda.sh
conda activate cugraph-ldgpu2

export LD_LIBRARY_PATH="/people/ghos167/builds/openmpi-4.1.4-cuda12/lib:/people/ghos167/.conda/envs/cugraph-ldgpu2/lib:$LD_LIBRARY_PATH"
ulimit -a
export BIN_PATH="$HOME/proj/cugraph-mg-test/mg-graph"

echo "Multi-GPU cuGraph C++ with MPI"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

/people/ghos167/builds/openmpi-4.1.4-cuda12/bin/mpirun -np 8 $BIN_PATH/./mg_test

Encountering segfault (from every process):

 4 --------------------------------------------------------------------------
  5 WARNING: There was an error initializing an OpenFabrics device.
  6
  7   Local host:   a100-04
  8   Local device: mlx5_2
  9 --------------------------------------------------------------------------
 10 [a100-04:203942] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init
 11 [a100-04:203942] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 12 [a100-04:203950:0:203950] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
 13 terminate called after throwing an instance of 'raft::cuda_error'
 14   what():  CUDA error encountered at: file=/people/ghos167/.conda/envs/cugraph-ldgpu2/include/raft/util/cudart_utils.hpp line=148:
 15 [a100-04:203949] *** Process received signal ***
 16 [a100-04:203949] Signal: Aborted (6)
 17 [a100-04:203949] Signal code:  (-6)
 18 [a100-04:203949] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b2104b62630]
 19 [a100-04:203949] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b2104da5387]
 20 [a100-04:203949] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b2104da6a78]
 21 [a100-04:203949] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xc0)[0x2b2104725f9e]
 22 [a100-04:203949] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(+0xb64e2)[0x2b21047244e2]
 23 [a100-04:203949] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x2b210471e2e3]
 24 [a100-04:203949] [ 6] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(__cxa_rethrow+0x0)[0x2b2104724702]
 25 [a100-04:203949] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN4raft4copyImEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x1cc)[0x2b20bae73d5c]
 26 [a100-04:203949] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x36e)[0x2b20baee183e]
 27 [a100-04:203949] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_    NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2b20bd705c01]
 28 [a100-04:203949] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2b20bd709585]
 29 [a100-04:203949] [11] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2b20bd70d149]
 30 [a100-04:203949] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca]
 31 [a100-04:203949] [13] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6]
 32 [a100-04:203949] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b2104d91555]
 33 [a100-04:203949] [15] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9]
 34 [a100-04:203949] *** End of error message ***
 35 ==== backtrace (tid: 203950) ====
 36  0  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x2ac7cf19cfed]
 37  1  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a1e1) [0x2ac7cf19d1e1]
 38  2  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a3aa) [0x2ac7cf19d3aa]
 39  3  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245) [0x2ac73b215245]
 40  4  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996) [0x2ac73b215996]
 41  5  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73) [0x2ac73b216b73]
 42  6  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9) [0x2ac73b210cd9]
 43  7  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc) [0x2ac73b206a1c]
 44  8  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x42f6e0]
45  9  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2) [0x2ac6ff012792]
 46 10  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_trai    tsISG_E10value_typeEEEb+0x151) [0x2ac701836c01]
 47 11  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545) [0x2ac70183a585]
 48 12  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49) [0x2ac70183e149]
 49 13  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x443bca]
 50 14  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x41b7e6]
 51 15  /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ac748ec2555]
 52 16  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x418be9]
 53 =================================
 54 [a100-04:203950] *** Process received signal ***
 55 [a100-04:203950] Signal: Segmentation fault (11)
 56 [a100-04:203950] Signal code:  (-6)
 57 [a100-04:203950] Failing at address: 0x32bec00031cae
 58 [a100-04:203950] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2ac748c93630]
 59 [a100-04:203950] [ 1] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245)[0x2ac73b215245]
 60 [a100-04:203950] [ 2] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996)[0x2ac73b215996]
 61 [a100-04:203950] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73)[0x2ac73b216b73]
 62 [a100-04:203950] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9)[0x2ac73b210cd9]
 63 [a100-04:203950] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc)[0x2ac73b206a1c]
 64 [a100-04:203950] [ 6] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x42f6e0]
 65 [a100-04:203950] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2)[0x2ac6ff012792]
 66 [a100-04:203950] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_    NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2ac701836c01]
 67 [a100-04:203950] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2ac70183a585]
 68 [a100-04:203950] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2ac70183e149]
 69 [a100-04:203950] [11] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca]
 70 [a100-04:203950] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6]
 71 [a100-04:203950] [13] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac748ec2555]
 72 [a100-04:203950] [14] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9]
 73 [a100-04:203950] *** End of error message ***
 74 [a100-04:203951:0:203951] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
 75 ==== backtrace (tid: 203951) ====
 76  0  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x2b63aa105fed]
 77  1  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a1e1) [0x2b63aa1061e1]
 78  2  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a3aa) [0x2b63aa1063aa]
 79  3  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245) [0x2b631617e245]
 80  4  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996) [0x2b631617e996]
 81  5  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73) [0x2b631617fb73]
 82  6  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9) [0x2b6316179cd9]
 83  7  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc) [0x2b631616fa1c]
 84  8  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x42f6e0]
 85  9  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2) [0x2b62d9f7b792]
 86 10  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_trai    tsISG_E10value_typeEEEb+0x151) [0x2b62dc79fc01]
 87 11  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545) [0x2b62dc7a3585]
 88 12  /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49) [0x2b62dc7a7149]
 89 13  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x443bca]
 90 14  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x41b7e6]
 91 15  /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b6323e2b555]
 92 16  /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x418be9]
 93 =================================
 94 [a100-04:203951] *** Process received signal ***
 95 [a100-04:203951] Signal: Segmentation fault (11)
 96 [a100-04:203951] Signal code:  (-6)
 97 [a100-04:203951] Failing at address: 0x32bec00031caf
 98 [a100-04:203951] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b6323bfc630]
 99 [a100-04:203951] [ 1] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245)[0x2b631617e245]
100 [a100-04:203951] [ 2] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996)[0x2b631617e996]
101 [a100-04:203951] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73)[0x2b631617fb73]
102 [a100-04:203951] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9)[0x2b6316179cd9]
103 [a100-04:203951] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc)[0x2b631616fa1c]
104 [a100-04:203951] [ 6] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x42f6e0]
105 [a100-04:203951] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2)[0x2b62d9f7b792]
106 [a100-04:203951] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_    NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2b62dc79fc01]
107 [a100-04:203951] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2b62dc7a3585]
108 [a100-04:203951] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2b62dc7a7149]
109 [a100-04:203951] [11] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca]
110 [a100-04:203951] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6]
111 [a100-04:203951] [13] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b6323e2b555]
112 [a100-04:203951] [14] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9]
113 [a100-04:203951] *** End of error message ***
114 terminate called after throwing an instance of 'raft::cuda_error'
115   what():  CUDA error encountered at: file=/people/ghos167/.conda/envs/cugraph-ldgpu2/include/raft/util/cudart_utils.hpp line=148:
116 [a100-04:203946] *** Process received signal ***
117 [a100-04:203946] Signal: Aborted (6)
118 [a100-04:203946] Signal code:  (-6)
119 [a100-04:203946] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b166a717630]
120 [a100-04:203946] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b166a95a387]
121 [a100-04:203946] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b166a95ba78]
122 [a100-04:203946] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xc0)[0x2b166a2daf9e]
123 [a100-04:203946] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(+0xb64e2)[0x2b166a2d94e2]
124 [a100-04:203946] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x2b166a2d32e3]
125 [a100-04:203946] [ 6] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(__cxa_rethrow+0x0)[0x2b166a2d9702]
126 [a100-04:203946] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN4raft4copyImEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x1cc)[0x2b1620a28d5c]
127 [a100-04:203946] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x36e)[0x2b1620a9683e]
128 [a100-04:203946] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tIS    C_NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2b16232bac01]
129 [a100-04:203946] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2b16232be585]
130 [a100-04:203946] [11] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2b16232c2149]
131 [a100-04:203946] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca]
132 [a100-04:203946] [13] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6]
133 [a100-04:203946] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b166a946555]
134 [a100-04:203946] [15] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9]
135 [a100-04:203946] *** End of error message ***
136 --------------------------------------------------------------------------
137 Primary job  terminated normally, but 1 process returned
138 a non-zero exit code. Per user-direction, the job has been aborted.
139 --------------------------------------------------------------------------
140 --------------------------------------------------------------------------
141 mpirun noticed that process rank 0 with PID 203946 on node a100-04 exited on signal 6 (Aborted).
142 --------------------------------------------------------------------------

Please advise; the platform OpenMPI is CUDA-aware:

(cugraph-ldgpu2) [ghos167@deception02 openmpi-4.1.4]$ $HOME/builds/openmpi-4.1.4-cuda12/bin/ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
  mca:mpi:base:param:mpi_built_with_cuda_support:value:true

Code of Conduct

ChuckHastings commented 3 months ago

First thing I would check is to make sure that you're getting what you asked for.

Try adding an nvidia-smi call before calling mpirun in order to show the allocation you're getting.

You might also set the following environment variable:

export NCCL_DEBUG=TRACE

This will litter your output file with debugging messages, but it looks like there might be a comms issue (it's failing in a all gather).

sg0 commented 2 months ago

Attaching the error and output files with the debug info.

CUGXX_6681940_4294967294.err.txt CUGXX_6681940_4294967294.out.txt

ChuckHastings commented 2 months ago

Thanks, I will review.

sg0 commented 2 months ago

As you may have found out, this issue is happening at the graph distribution phase, and it might be due to a GPU not having a partition since the test graph is relatively small (this is easy to fix, just throw an exception if a GPU has empty buffers). Running this on 2 GPUs works at my end, with this error:

*** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[a100-04:29501] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[a100-04:29500] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

This is probably because MPI_Comm_free is invoked in the handle destructor, which gets called after MPI_Finalize, leading to memory leak.

Another related question - what is the easiest way to read and distribute matrix market files (from SuiteSparse collection) to use in C++ MG graph codes? Is there a function in cugraph utilities that can be used?

ChuckHastings commented 2 months ago

Sorry, I had not had a chance to look through your logs last week.

These tests were written to get folks started in calling our C++ code directly. I think you have identified a couple of edge conditions that aren't being handled properly in these tests.

About these specific issues:

Regarding reading matrix market files... we have functions within our test suite that can be used for this:

There's no fundamental difference (you can look in the code). If you want to tweak the edge list in some way before creating the graph you probably should use the first, otherwise the second is less code to manage.

These are less than optimal (we only use them for testing). The biggest issue is that each GPU reads the entire MTX file and then filters out the subset it cares about. That means that you need sufficient GPU memory on each node to contain the entire edge list. I created a function somewhere (never merged it into the code base) that would read a different block of data on each GPU to do the parsing and then shuffle the parsed data to the proper GPU. You could adapt the code I linked to have each GPU read the file in blocks and filter the edges that are relevant there. That would let you manage the memory size but still would result in duplicate computations.

sg0 commented 2 months ago

Thanks, even after invoking handle.reset before MPI_Finalize, I am getting some errors from MPI, but it's not on the critical path. I am closing the issue.