and inter-node collectives fail with the following message:
select.c:450 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, rocm_copy/rocm_cpy - no am bcopy, rocm_ipc/rocm_ipc - no am bcopy, rocm_gdr/rocm_gdr - no am bcopy, cma/memory - no am bcopy
pml_ucx.c:385 Error: ucp_ep_create(proc=8) failed: Destination is unreachable
Interestingly, the OSU pt2pt benchmarks work with the D D arguments.
Furthermore, I have confirmed that the check_large_bar sanity check from the given tutorial works on this system.
We would be really interested in getting this working in order to add OMPI+UCX to our MPIs on this system.
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"; The following is osu_allreduce for 4 ranks on one node:
corona152
corona152
corona152
corona152
Warning: OMB could not identify the local rank of the process.
This can lead to multiple processes using the same GPU.
Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
This can lead to multiple processes using the same GPU.
Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
This can lead to multiple processes using the same GPU.
Please use the get_local_rank script in the OMB repo for this.
Warning: OMB could not identify the local rank of the process.
This can lead to multiple processes using the same GPU.
Please use the get_local_rank script in the OMB repo for this.
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: corona152
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: corona152
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: corona152
Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: corona152
Local device: mlx5_0
--------------------------------------------------------------------------
...
[1611610759.881867] [corona152:22952:0] ucp_request.inl:165 UCX REQ completing send request 0x7fffffffa920 (0x7fffffffaa20) ------- Success
[1611610759.881870] [corona152:22952:0] tag_send.c:253 UCX REQ send_nbx buffer (nil) count 0 tag fffff00000000000 to <no debug data>
[1611610759.881872] [corona152:22952:0] tag_send.c:84 UCX REQ select tag request(0x7fffffffa920) progress algorithm datatype=0x8 buffer=(nil) length=0 mem_type:host max_short=92 rndv_thresh=262144 zcopy_thresh=262144 zcopy_enabled=0
[1611610759.881875] [corona152:22952:0] mm_ep.c:280 UCX DATA TX: AM_SHORT am_id 2 len 8 EGR_O tag fffff00000000000
[1611610759.881881] [corona152:22952:0] mm_ep.c:117 UCX TRACE sent wakeup from socket 29 to 0xa90198
[1611610759.881882] [corona152:22952:0] ucp_request.inl:165 UCX REQ completing send request 0x7fffffffa920 (0x7fffffffaa20) ------- Success
[1611610759.881884] [corona152:22952:0] tag_send.c:253 UCX REQ send_nbx buffer (nil) count 0 tag fffff00000000000 to <no debug data>
[1611610759.881887] [corona152:22952:0] tag_send.c:84 UCX REQ select tag request(0x7fffffffa920) progress algorithm datatype=0x8 buffer=(nil) length=0 mem_type:host max_short=92 rndv_thresh=262144 zcopy_thresh=262144 zcopy_enabled=0
[1611610759.881889] [corona152:22952:0] mm_ep.c:280 UCX DATA TX: AM_SHORT am_id 2 len 8 EGR_O tag fffff00000000000
[1611610759.881895] [corona152:22952:0] mm_ep.c:117 UCX TRACE sent wakeup from socket 29 to 0xa902c8
[1611610759.881897] [corona152:22952:0] ucp_request.inl:165 UCX REQ completing send request 0x7fffffffa920 (0x7fffffffaa20) ------- Success
[1611610759.881949] [corona152:22952:0] tag_recv.c:218 UCX REQ allocated request 0xaedbc0
[1611610759.881951] [corona152:22952:0] tag_recv.c:40 UCX REQ req 0xaedbc0: recv_nbx buffer 0x2aabcb200000 dt 0x8 count 4 tag fffff40000100000/ffffffffffffffff
[1611610759.881953] [corona152:22952:0] tag_recv.c:128 UCX REQ recv_nbx returning expected request 0xaedbc0 (0xaedcc0)
[1611610759.881955] [corona152:22952:0] tag_send.c:253 UCX REQ send_nbx buffer 0xadfcd0 count 4 tag fffff40000000000 to <no debug data>
[1611610759.881957] [corona152:22952:0] tag_send.c:84 UCX REQ select tag request(0x7fffffffa6a0) progress algorithm datatype=0x8 buffer=0xadfcd0 length=4 mem_type:host max_short=92 rndv_thresh=262144 zcopy_thresh=262144 zcopy_enabled=0
[1611610759.881960] [corona152:22952:0] mm_ep.c:280 UCX DATA TX: AM_SHORT am_id 2 len 12 EGR_O tag fffff40000000000
[1611610759.881961] [corona152:22952:0] ucp_request.inl:165 UCX REQ completing send request 0x7fffffffa6a0 (0x7fffffffa7a0) ------- Success
[1611610759.881963] [corona152:22952:0] mm_iface.c:232 UCX DATA RX: AM_SHORT am_id 2 len 12 EGR_O tag fffff40000100000
[1611610759.881965] [corona152:22952:0] tag_match.inl:119 UCX DATA checking req 0xaedbc0 tag fffff40000100000/ffffffffffffffff with tag fffff40000100000
[1611610759.881967] [corona152:22952:0] tag_match.inl:121 UCX REQ matched received tag fffff40000100000 to req 0xaedbc0
[1611610759.881969] [corona152:22952:0] eager_rcv.c:25 UCX REQ found req 0xaedbc0
[1611610759.881971] [corona152:22952:0] ucp_request.inl:547 UCX REQ req 0xaedbc0: unpack recv_data req_len 4 data_len 4 offset 0 last: yes
${HOME}/opt/osumb/bin/corona/openmpi/collective/osu_allreduce: symbol lookup error: ${HOME}/corona/opt/ucx-1.10.0/lib/ucx/libuct_rocm_gdr.so.0: undefined symbol: gdr_copy_to_bar
[1611610759.881861] [corona152:22953:0] tag_match.inl:119 UCX DATA checking req 0x7fffffffa920 tag fffff00000000000/ffffffffffffffff with tag fffff00000000000
${HOME}/opt/osumb/bin/corona/openmpi/collective/osu_allreduce: symbol lookup error: ${HOME}/corona/opt/ucx-1.10.0/lib/ucx/libuct_rocm_gdr.so.0: undefined symbol: gdr_copy_to_bar
[1611610759.881881] [corona152:22954:0] tag_match.inl:119 UCX DATA checking req 0x7fffffffa920 tag fffff00000000000/ffffffffffffffff with tag fffff00000000000
${HOME}/opt/osumb/bin/corona/openmpi/collective/osu_allreduce: symbol lookup error: ${HOME}/corona/opt/ucx-1.10.0/lib/ucx/libuct_rocm_gdr.so.0: undefined symbol: gdr_copy_to_bar
[1611610759.881892] [corona152:22955:0] mm_iface.c:232 UCX DATA RX: AM_SHORT am_id 2 len 8 EGR_O tag fffff00000000000
${HOME}/opt/osumb/bin/corona/openmpi/collective/osu_allreduce: symbol lookup error: ${HOME}/corona/opt/ucx-1.10.0/lib/ucx/libuct_rocm_gdr.so.0: undefined symbol: gdr_copy_to_bar
[1611610759.881985] [corona152:22952:0] ucp_mm.c:122 UCX TRACE registered address 0x2aabcb200000 length 4 on md[4] memh[0]=0xadf[1611610759.881896] [corona152:22955:0] tag_match.srun: error: corona152: tasks 0-3: Exited with exit code 127
This appears to be related to Issue #4489.
A temporary workaround for ROCm is to insert --without-gdrcopy in your config line so that the rocm_gdr transport doesn't get built in the first place.
OpenMPI and UCX Issue for UCX Github
Describe the bug
When following the UCX+ROCm tutorial here, intra-node collectives, such as
osu_allreduce -d rocm
, fail with the following message:and inter-node collectives fail with the following message:
Interestingly, the OSU pt2pt benchmarks work with the
D D
arguments. Furthermore, I have confirmed that thecheck_large_bar
sanity check from the given tutorial works on this system.We would be really interested in getting this working in order to add OMPI+UCX to our MPIs on this system.
Thanks!
Steps to Reproduce
osu_allreduce -d rocm -m 2:4194304 -x 20 -f
--disable-dependency-tracking --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --enable-cma --with-cm --with-rdmacm --with-rocm --with-verbs --without-cuda --without-knem --without-xpmem --without-ugni --without-java
UCX_TLS=cma,mm,rocm,rocm_copy,rocm_gdr,rocm_ipc
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
3.10.20465
113-D1631200-111
5.6.19
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX...