openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

Open UCX 1.15.0 with Open MPI 5.0.2 and CUDA 12.3 - cudaHostUnregister call error about pointer does not correspond to a registered memory region #9716

Closed wsdal closed 5 months ago

wsdal commented 6 months ago

Describe the bug

Background: Using the documentation provided for Open UCX and Open MPI (OpenSHMEM + CUDA support), I was able to configure and install (confirmed with configuration and info tools) both what I assumed was correctly. I was also able to run benchmarks from https://mvapich.cse.ohio-state.edu/benchmarks/ without errors.

The issue I'm encountering is when using this environment, I am getting the following errors when calling shmem_finalize() at the end of my application.

[1709076623.728064] [GPU-LNX-CLUSTER-01:1374466:0]            event.c:244  UCX  TRACE ucm_vm_munmap(addr=0x7f0082001000 length=33550336)
[1709076623.728069] [GPU-LNX-CLUSTER-01:1374466:0]            event.h:73   UCX  TRACE vm_unmap addr=0x7f0082001000 length=33550336
[1709076623.728514] [GPU-LNX-CLUSTER-01:1374466:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709076623.728522] [GPU-LNX-CLUSTER-01:1374466:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error
[1709076623.728531] [GPU-LNX-CLUSTER-01:1374466:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709076623.728534] [GPU-LNX-CLUSTER-01:1374466:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error
shmem_finalize ``` before shmem_finalize [1709071564.073188] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0525e00000 length=270532608) [1709071564.073190] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0525e00000 length=270532608 [1709071564.073519] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0534001000 length=33550336) [1709071564.073523] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0534001000 length=33550336 [1709071564.073736] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x200200000 length=2097152) [1709071564.073739] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x200200000 length=2097152 [1709071564.081537] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x200400000 length=58720256) [1709071564.081540] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x200400000 length=58720256 [1709071564.082780] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fdf000 length=4096) [1709071564.082783] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fdf000 length=4096 [1709071564.083196] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204a00000 length=2097152) [1709071564.083200] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x204a00000 length=2097152 [1709071564.083474] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204c00000 length=2097152) [1709071564.083477] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x204c00000 length=2097152 [1709071564.083643] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe0000 length=4096) [1709071564.083645] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe0000 length=4096 [1709071564.083893] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe1000 length=4096) [1709071564.083896] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe1000 length=4096 [1709071564.084103] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe2000 length=4096) [1709071564.084105] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe2000 length=4096 [1709071564.084293] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe3000 length=4096) [1709071564.084295] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe3000 length=4096 [1709071564.084484] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe4000 length=4096) [1709071564.084486] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe4000 length=4096 [1709071564.084683] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe5000 length=4096) [1709071564.084685] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe5000 length=4096 [1709071564.084881] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe6000 length=4096) [1709071564.084883] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe6000 length=4096 [1709071564.086356] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe7000 length=4096) [1709071564.086359] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe7000 length=4096 [1709071564.086541] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe8000 length=4096) [1709071564.086544] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe8000 length=4096 [1709071564.086717] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fe9000 length=4096) [1709071564.086720] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fe9000 length=4096 [1709071564.086878] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fea000 length=4096) [1709071564.086880] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fea000 length=4096 [1709071564.087282] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550feb000 length=4096) [1709071564.087286] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550feb000 length=4096 [1709071564.087505] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fec000 length=4096) [1709071564.087508] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fec000 length=4096 [1709071564.087703] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fed000 length=4096) [1709071564.087705] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fed000 length=4096 [1709071564.087899] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fee000 length=4096) [1709071564.087901] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fee000 length=4096 [1709071564.088317] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550fef000 length=4096) [1709071564.088321] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550fef000 length=4096 [1709071564.088514] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550ff0000 length=4096) [1709071564.088516] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550ff0000 length=4096 [1709071564.088691] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550ff1000 length=4096) [1709071564.088693] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550ff1000 length=4096 [1709071564.088858] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0550ff2000 length=4096) [1709071564.088861] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0550ff2000 length=4096 [1709071564.089376] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204e00000 length=2097152) [1709071564.089380] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x204e00000 length=2097152 [1709071564.094125] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205000000 length=2097152) [1709071564.094128] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x205000000 length=2097152 [1709071564.094382] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205200000 length=2097152) [1709071564.094386] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x205200000 length=2097152 [1709071564.094989] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0536400000 length=2097152) [1709071564.094993] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0536400000 length=2097152 [1709071564.095084] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0536600000 length=2097152) [1709071564.095087] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0536600000 length=2097152 [1709071564.095795] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205600000 length=2097152) [1709071564.095798] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x205600000 length=2097152 [1709071564.095992] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0536800000 length=2097152) [1709071564.095994] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0536800000 length=2097152 [1709071564.097030] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205a00000 length=2097152) [1709071564.097033] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x205a00000 length=2097152 [1709071564.097923] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0537000000 length=2097152) [1709071564.097927] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0537000000 length=2097152 [1709071564.101628] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0537200000 length=2097152) [1709071564.101633] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0537200000 length=2097152 [1709071564.102362] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0537400000 length=1536000) [1709071564.102366] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0537400000 length=1536000 [1709071564.103255] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0532001000 length=33550336) [1709071564.103260] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0532001000 length=33550336 [1709071564.103742] [GPU-LNX-CLUSTER-01:1372671:0] cuda_copy_md.c:182 UCX ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region [1709071564.103752] [GPU-LNX-CLUSTER-01:1372671:0] ucp_mm.c:332 UCX WARN failed to dereg from md[2]=cuda_cpy: Input/output error [1709071564.103761] [GPU-LNX-CLUSTER-01:1372671:0] cuda_copy_md.c:182 UCX ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region [1709071564.103763] [GPU-LNX-CLUSTER-01:1372671:0] ucp_mm.c:332 UCX WARN failed to dereg from md[2]=cuda_cpy: Input/output error [1709071564.103785] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0xff000000 length=270532608) [1709071564.103788] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0xff000000 length=270532608 [1709071564.124599] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0568c2e000 length=12288) [1709071564.124604] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0568c2e000 length=12288 [1709071564.124617] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0568c2b000 length=8447) [1709071564.124618] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0568c2b000 length=8447 [1709071564.124949] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f055cdac000 length=4296704) [1709071564.124951] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f055cdac000 length=4296704 [1709071564.125168] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0568c34000 length=12288) [1709071564.125170] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0568c34000 length=12288 [1709071564.125227] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f055c993000 length=4296704) [1709071564.125229] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f055c993000 length=4296704 [1709071564.125485] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0568c31000 length=12288) [1709071564.125486] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0568c31000 length=12288 [1709071564.125528] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0567084000 length=131072) [1709071564.125529] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0567084000 length=131072 [1709071564.125548] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f05670a4000 length=176128) [1709071564.125550] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f05670a4000 length=176128 [1709071564.126012] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f056902d000 length=12288) [1709071564.126015] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f056902d000 length=12288 [1709071564.167521] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f05670cf000 length=139264) [1709071564.167524] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f05670cf000 length=139264 [1709071564.167661] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f056418c000 length=4296704) [1709071564.167663] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f056418c000 length=4296704 [1709071564.168053] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f056bafb000 length=12288) [1709071564.168055] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f056bafb000 length=12288 [1709071564.168111] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f055d1c5000 length=4296704) [1709071564.168113] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f055d1c5000 length=4296704 [1709071564.168386] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f05694ae000 length=12288) [1709071564.168388] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f05694ae000 length=12288 [1709071564.168408] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f05670f1000 length=131072) [1709071564.168410] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f05670f1000 length=131072 [1709071564.168420] [GPU-LNX-CLUSTER-01:1372671:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f0567111000 length=176128) [1709071564.168422] [GPU-LNX-CLUSTER-01:1372671:0] event.h:73 UCX TRACE vm_unmap addr=0x7f0567111000 length=176128 after shmem_finalize ```

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp90s0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 113.16/ppn + 0.00 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 32727588K
#           remote key: 24 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#         memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#
#      Transport: cuda_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#           memory invalidation is supported
#         memory types: cuda (access,reg,cache)
#
#      Transport: cuda_ipc
#         Device: cuda
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 250000.00/ppn + 0.00 MB/sec
#              latency: 1000 nsec
#             overhead: 7000 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: <= 0, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#

config.log ``` checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking how to run the C preprocessor... gcc -E checking for grep that handles long lines and -e... /usr/bin/grep checking for egrep... /usr/bin/grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking minix/config.h usability... no checking minix/config.h presence... no checking for minix/config.h... no checking whether it is safe to define __EXTENSIONS__... yes checking for git... yes checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /usr/bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking for style of include used by make... GNU checking whether make supports nested variables... yes checking whether UID '1001' is supported by ustar format... yes checking whether GID '1001' is supported by ustar format... yes checking how to create a ustar tar archive... gnutar checking dependency style of gcc... gcc3 checking whether make supports nested variables... (cached) yes checking whether to enable maintainer-specific portions of Makefiles... no checking for gcc... (cached) gcc checking whether we are using the GNU C compiler... (cached) yes checking whether gcc accepts -g... (cached) yes checking for gcc option to accept ISO C89... (cached) none needed checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking for gcc option to support OpenMP... -fopenmp checking dependency style of gcc... gcc3 checking whether ln -s works... yes checking for a sed that does not truncate output... /usr/bin/sed checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking how to print strings... printf checking for a sed that does not truncate output... (cached) /usr/bin/sed checking for fgrep... /usr/bin/grep -F checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking the maximum length of command line arguments... 1572864 checking whether the shell understands some XSI constructs... yes checking whether the shell understands "+="... yes checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop checking for /usr/bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... no checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no checking for mt... mt checking if mt is a manifest tool... no checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... yes checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... /usr/bin/ld -m elf_x86_64 checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... yes checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for cos in -lm... yes checking for C/C++ restrict keyword... __restrict checking whether strerror_r is declared... yes checking for strerror_r... yes checking whether strerror_r returns char *... yes checking for pkg-config... /usr/bin/pkg-config checking if ln -s supports --relative... yes checking for dot... yes checking for doxygen... no configure: WARNING: doxygen not found - will not generate any doxygen documentation ../configure: line 17916: doxygen: command not found configure: WARNING: doxygen version is bad. Required version: 1.8.6 and above checking for perl... /usr/bin/perl checking for size_t... yes checking compiler flag -diag-error 10006... no checking compiler flag -diag-error 10148... no checking whether -diag-disable 1478 overrides deprecated declarations... no checking whether -Wno-deprecated-declarations overrides deprecated declarations... yes checking compiler flag -diag-disable 269... no checking compiler flag -fmax-type-align=16... no configure: Detected CPU implementation: configure: Detected CPU architecture: configure: Detected CPU variant: configure: Detected CPU part: checking for __attribute__(optimize)... 1 checking compiler flag -funwind-tables... yes configure: compiling with unwind tables checking if g++ works... yes checking c++11 support... yes checking gnu++11 support... yes checking whether _GLIBCXX_NOTHROW is declared... yes checking compiler flag --display_error_number... no checking compiler flag --diag_suppress 1... no checking compiler flag --diag_suppress 68... no checking compiler flag --diag_suppress 111... no checking compiler flag --diag_suppress 167... no checking compiler flag --diag_suppress 181... no checking compiler flag --diag_suppress 188... no checking compiler flag --diag_suppress 381... no checking compiler flag --diag_suppress 1215... no checking compiler flag --diag_suppress 1901... no checking compiler flag --diag_suppress 1902... no checking compiler flag -pedantic... yes checking compiler flag -Wl,-dynamic-list-data... yes checking compiler flag -Wno-missing-field-initializers... yes checking compiler flag -Wno-unused-parameter... yes checking compiler flag -Wno-unused-label... yes checking compiler flag -Wno-long-long... yes checking compiler flag -Wno-endif-labels... yes checking compiler flag -Wno-sign-compare... yes checking compiler flag -Wno-multichar... yes checking compiler flag -Wno-deprecated-declarations... yes checking compiler flag -Winvalid-pch... yes checking compiler flag -Wno-pointer-sign... yes checking compiler flag -Werror-implicit-function-declaration... yes checking compiler flag -Wno-format-zero-length... yes checking compiler flag -Wnested-externs... yes checking compiler flag -Wshadow... yes checking compiler flag -Werror=declaration-after-statement... yes checking for working alloca.h... yes checking for alloca... yes checking for shm_open in -lrt... yes checking for timer_create in -lrt... yes checking libgen.h usability... yes checking libgen.h presence... yes checking for libgen.h... yes checking whether asprintf is declared... yes checking whether basename is declared... yes checking whether fmemopen is declared... yes checking sys/cpuset.h usability... no checking sys/cpuset.h presence... no checking for sys/cpuset.h... no checking whether CPU_ZERO is declared... yes checking whether CPU_ISSET is declared... yes checking for cpu_set_t... yes checking for cpuset_t... no checking for sighandler_t... yes checking for __sighandler_t... yes checking pthread_np.h usability... no checking pthread_np.h presence... no checking for pthread_np.h... no checking for library containing pthread_create... none required checking for library containing pthread_atfork... none required checking for clearenv... yes checking for malloc_trim... yes checking for memalign... yes checking for posix_memalign... yes checking for mremap... yes checking for sched_setaffinity... yes checking for sched_getaffinity... yes checking for cpuset_setaffinity... no checking for cpuset_getaffinity... no checking whether F_SETOWN_EX is declared... yes checking whether ethtool_cmd_speed is declared... yes checking whether SPEED_UNKNOWN is declared... yes checking sys/platform/ppc.h usability... no checking sys/platform/ppc.h presence... no checking for sys/platform/ppc.h... no checking whether __ppc_get_timebase_freq is declared... no checking whether __ppc_get_timebase is declared... no checking for using Google C++ Testing Framework... no checking malloc hooks... no configure: WARNING: malloc hooks are not supported checking sys/capability.h usability... no checking sys/capability.h presence... no checking for sys/capability.h... no checking whether PR_SET_PTRACER is declared... yes checking for struct in6_addr.s6_addr32... yes checking for struct in6_addr.__u6_addr.__u6_addr32... no checking for struct iphdr.daddr.s_addr... no checking for struct ip.ip_dst.s_addr... yes checking for struct sigevent._sigev_un._tid... yes checking for struct sigevent.sigev_notify_thread_id... no checking for struct sigaction.sa_restorer... yes checking sys/epoll.h usability... yes checking sys/epoll.h presence... yes checking for sys/epoll.h... yes checking sys/eventfd.h usability... yes checking sys/eventfd.h presence... yes checking for sys/eventfd.h... yes checking sys/event.h usability... no checking sys/event.h presence... no checking for sys/event.h... no checking sys/thr.h usability... no checking sys/thr.h presence... no checking for sys/thr.h... no checking malloc.h usability... yes checking malloc.h presence... yes checking for malloc.h... yes checking malloc_np.h usability... no checking malloc_np.h presence... no checking for malloc_np.h... no checking endian.h, usability... no checking endian.h, presence... no checking for endian.h,... no checking sys/endian.h usability... no checking sys/endian.h presence... no checking for sys/endian.h... no checking linux/mman.h usability... yes checking linux/mman.h presence... yes checking for linux/mman.h... yes checking linux/ip.h usability... yes checking linux/ip.h presence... yes checking for linux/ip.h... yes checking linux/futex.h usability... yes checking linux/futex.h presence... yes checking for linux/futex.h... yes checking for net/ethernet.h... yes checking for netinet/ip.h... yes configure: Memory allocator is ptmalloc-2.8.6 version checking for malloc_get_state... no checking for malloc_set_state... no checking whether MADV_FREE is declared... yes checking whether MADV_REMOVE is declared... yes checking whether POSIX_MADV_DONTNEED is declared... yes checking whether getauxval is declared... yes checking whether SYS_mmap is declared... yes checking whether SYS_munmap is declared... yes checking whether SYS_mremap is declared... yes checking whether SYS_brk is declared... yes checking whether SYS_madvise is declared... yes checking whether SYS_shmat is declared... yes checking whether SYS_shmdt is declared... yes checking whether SYS_ipc is declared... no checking for __curbrk... yes checking for tc_malloc in -ltcmalloc... no Package fuse3 was not found in the pkg-config search path. Perhaps you should add the directory containing `fuse3.pc' to the PKG_CONFIG_PATH environment variable No package 'fuse3' found Package fuse3 was not found in the pkg-config search path. Perhaps you should add the directory containing `fuse3.pc' to the PKG_CONFIG_PATH environment variable No package 'fuse3' found Package fuse3 was not found in the pkg-config search path. Perhaps you should add the directory containing `fuse3.pc' to the PKG_CONFIG_PATH environment variable No package 'fuse3' found checking whether fuse_open_channel is declared... no checking whether fuse_mount is declared... no checking whether fuse_unmount is declared... no checking for go... no configure: WARNING: Disabling GO support - GO compiler version 1.16 or newer not found. checking for mvn... no checking for java... yes configure: WARNING: Disabling Java support - java or mvn not in path. checking cuda.h usability... yes checking cuda.h presence... yes checking for cuda.h... yes checking cuda_runtime.h usability... yes checking cuda_runtime.h presence... yes checking for cuda_runtime.h... yes checking for cuDeviceGetUuid in -lcuda... yes checking for cudaGetDeviceCount in -lcudart... yes checking nvml.h usability... yes checking nvml.h presence... yes checking for nvml.h... yes checking for nvmlInit in -lnvidia-ml... yes checking for cudaGetDeviceCount in -lcudart_static... no configure: ROCm path was not specified. Guessing ... checking hsa.h usability... no checking hsa.h presence... no checking for hsa.h... no configure: WARNING: ROCm not found checking for hsa_amd_portable_export_dmabuf... no checking for hipFree in -lhip_hcc... no checking hip_runtime.h usability... no checking hip_runtime.h presence... no checking for hip_runtime.h... no configure: WARNING: HIP Runtime not found checking whether inotify_init is declared... yes checking whether inotify_add_watch is declared... yes checking whether IN_ATTRIB is declared... yes checking for bfd_openr in -lbfd... no checking for bfd_openr in -lbfd... no checking for bfd_openr in -lbfd... no checking bfd.h usability... no checking bfd.h presence... no checking for bfd.h... no checking for struct dl_phdr_info... yes checking __attribute__((constructor))... yes configure: enabling builtin memcpy checking for __clear_cache... yes checking for __aarch64_sync_cache_range... no checking gdrapi.h usability... no checking gdrapi.h presence... no checking for gdrapi.h... no configure: WARNING: GDR_COPY not found configure: Compiling with verbs support from /usr checking infiniband/verbs.h usability... yes checking infiniband/verbs.h presence... yes checking for infiniband/verbs.h... yes checking for ibv_get_device_list in -libverbs... yes checking whether ibv_wc_status_str is declared... yes checking whether ibv_event_type_str is declared... yes checking whether ibv_query_gid is declared... yes checking whether ibv_get_device_name is declared... yes checking whether ibv_create_srq is declared... yes checking whether ibv_get_async_event is declared... yes checking whether IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN is declared... yes configure: Checking for DV bare-metal support checking for mlx5dv_query_device in -lmlx5-rdmav2... no checking for mlx5dv_query_device in -lmlx5... yes checking for infiniband/mlx5dv.h... yes checking whether mlx5dv_init_obj is declared... yes checking whether mlx5dv_create_qp is declared... yes checking whether mlx5dv_is_supported is declared... yes checking whether mlx5dv_devx_subscribe_devx_event is declared... yes checking whether MLX5DV_CQ_INIT_ATTR_MASK_COMPRESSED_CQE is declared... yes checking whether MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE is declared... yes checking whether MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE is declared... yes checking whether MLX5DV_UAR_ALLOC_TYPE_BF is declared... yes checking whether MLX5DV_UAR_ALLOC_TYPE_NC is declared... yes checking whether mlx5dv_devx_umem_reg_ex is declared... yes checking for struct mlx5dv_cq.cq_uar... yes checking whether MLX5DV_OBJ_AH is declared... yes checking whether MLX5DV_DCTYPE_DCT is declared... yes checking whether ibv_alloc_td is declared... yes checking whether MLX5DV_CONTEXT_FLAGS_DEVX is declared... yes checking whether IBV_LINK_LAYER_INFINIBAND is declared... yes checking whether IBV_LINK_LAYER_ETHERNET is declared... yes checking whether IBV_EVENT_GID_CHANGE is declared... yes checking whether IBV_TRANSPORT_USNIC is declared... yes checking whether IBV_TRANSPORT_USNIC_UDP is declared... yes checking whether IBV_TRANSPORT_UNSPECIFIED is declared... yes checking whether ibv_create_qp_ex is declared... yes checking whether ibv_create_cq_ex is declared... yes checking whether ibv_create_srq_ex is declared... yes checking whether ibv_reg_dmabuf_mr is declared... yes checking whether ibv_set_ece is declared... yes checking whether ibv_query_device_ex is declared... yes checking for struct ibv_device_attr_ex.pci_atomic_caps... yes checking whether IBV_ACCESS_RELAXED_ORDERING is declared... yes checking whether IBV_ACCESS_ON_DEMAND is declared... yes checking whether IBV_QPF_GRH_REQUIRED is declared... yes checking whether ibv_advise_mr is declared... yes checking for struct mlx5_wqe_av.base... no checking for struct mlx5_grh_av.rmac... no checking for struct mlx5_cqe64.ib_stride_index... no checking for struct ibv_tmh.tag... yes checking for struct ibv_tm_caps.flags... yes checking whether ibv_alloc_dm is declared... yes configure: Checking OFED valgrind libs /usr/lib64/mlnx_ofed/valgrind checking /usr/include/rdma/rdma_cma.h usability... no checking /usr/include/rdma/rdma_cma.h presence... no checking for /usr/include/rdma/rdma_cma.h... no configure: WARNING: RDMACM requested but required file (rdma/rdma_cma.h) could not be found in /usr checking sys/uio.h usability... yes checking sys/uio.h presence... yes checking for sys/uio.h... yes checking for process_vm_readv... yes configure: KNEM path was not found, guessing ... Package knem was not found in the pkg-config search path. Perhaps you should add the directory containing `knem.pc' to the PKG_CONFIG_PATH environment variable No package 'knem' found checking whether KNEM_CMD_GET_INFO is declared... no configure: WARNING: KNEM requested but required file (knem_io.h) could not be found configure: XPMEM - failed to open the requested location (guess), guessing ... checking cray-ugni... no checking whether IPPROTO_TCP is declared... yes checking whether SOL_SOCKET is declared... yes checking whether SO_KEEPALIVE is declared... yes checking whether TCP_KEEPCNT is declared... yes checking whether TCP_KEEPIDLE is declared... yes checking whether TCP_KEEPINTVL is declared... yes checking compiler flag -fno-exceptions... yes checking compiler flag -fno-rtti... yes checking compiler flag --no_exceptions... no checking compiler flag -fno-tree-vectorize... yes checking compiler flag --diag_suppress 186... no checking compiler flag --diag_suppress 236... no checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating src/ucm/cuda/Makefile config.status: creating src/ucm/rocm/Makefile config.status: creating src/ucm/Makefile config.status: creating src/ucs/vfs/sock/Makefile config.status: creating src/ucs/vfs/fuse/Makefile config.status: creating src/ucs/vfs/fuse/ucx-fuse.pc config.status: creating src/ucs/Makefile config.status: creating src/ucs/signal/Makefile config.status: creating src/ucs/ucx-ucs.pc config.status: creating src/uct/cuda/gdr_copy/Makefile config.status: creating src/uct/cuda/gdr_copy/ucx-gdrcopy.pc config.status: creating src/uct/cuda/Makefile config.status: creating src/uct/cuda/ucx-cuda.pc config.status: creating src/uct/ib/rdmacm/Makefile config.status: creating src/uct/ib/rdmacm/ucx-rdmacm.pc config.status: creating src/uct/ib/Makefile config.status: creating src/uct/ib/ucx-ib.pc config.status: creating src/uct/rocm/Makefile config.status: creating src/uct/rocm/ucx-rocm.pc config.status: creating src/uct/sm/scopy/cma/Makefile config.status: creating src/uct/sm/scopy/cma/ucx-cma.pc config.status: creating src/uct/sm/scopy/knem/Makefile config.status: creating src/uct/sm/scopy/knem/ucx-knem.pc config.status: creating src/uct/sm/scopy/Makefile config.status: creating src/uct/sm/mm/xpmem/Makefile config.status: creating src/uct/sm/mm/xpmem/ucx-xpmem.pc config.status: creating src/uct/sm/mm/Makefile config.status: creating src/uct/sm/Makefile config.status: creating src/uct/ugni/Makefile config.status: creating src/uct/ugni/ucx-ugni.pc config.status: creating src/uct/Makefile config.status: creating src/uct/ucx-uct.pc config.status: creating src/tools/perf/lib/Makefile config.status: creating src/tools/perf/cuda/Makefile config.status: creating src/tools/perf/rocm/Makefile config.status: creating src/tools/perf/Makefile config.status: creating test/gtest/common/googletest/Makefile config.status: creating test/gtest/ucm/test_dlopen/Makefile config.status: creating test/gtest/ucm/test_dlopen/rpath-subdir/Makefile config.status: creating test/gtest/ucs/test_module/Makefile config.status: creating test/gtest/Makefile config.status: creating test/apps/uct_info/Makefile config.status: creating Makefile config.status: creating docs/doxygen/header.tex config.status: creating src/uct/api/version.h config.status: creating ucx.spec config.status: creating ucx.pc config.status: creating contrib/rpmdef.sh config.status: creating debian/rules config.status: creating debian/control config.status: creating debian/changelog config.status: creating src/ucp/Makefile config.status: creating src/ucp/api/ucp_version.h config.status: creating src/ucp/core/ucp_version.c config.status: creating src/tools/vfs/Makefile config.status: creating src/tools/info/Makefile config.status: creating src/tools/profile/Makefile config.status: creating test/apps/Makefile config.status: creating test/apps/iodemo/Makefile config.status: creating test/apps/sockaddr/Makefile config.status: creating test/apps/profiling/Makefile config.status: creating test/mpi/Makefile config.status: creating bindings/go/Makefile config.status: creating bindings/java/Makefile config.status: creating bindings/java/pom.xml config.status: creating bindings/java/src/main/native/Makefile config.status: creating examples/Makefile config.status: creating cmake/Makefile config.status: creating cmake/ucx-config-version.cmake config.status: creating cmake/ucx-config.cmake config.status: creating cmake/ucx-targets.cmake config.status: creating test/mpi/run_mpi.sh config.status: creating config.h config.status: executing depfiles commands config.status: executing libtool commands configure: ========================================================= configure: UCX build configuration: configure: Build prefix: /home/localadmin/dal/local/ucx-1.15.0 configure: Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: MPI tests: disabled configure: VFS support: no configure: Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < cuda ib cma > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < cuda > configure: Perf modules: < cuda > configure: ========================================================= ```
roiedanino commented 6 months ago

Hi @wsdal,

  1. Can you please provide a minimalistic main function that reproduces the issue?
  2. Does the issue occurs also when not setting: UCX_MEMTYPE_CACHE=n ?

Thanks

wsdal commented 6 months ago

Hey @roiedanino

  1. I'll try to put one together but the issue I continue to encounter is that minimalistic implementations are working fine. I was handed a large application that was originally put together using Open MPI 1.10, so I'm trying to rework it for Open UCX + OpenMPI. Not sure if this information could provide any insight.
  2. Issue occurs regardless of UCX_MEMTYPE_CACHE

Some additional information is that the application does run without a segmentation fault, just with those errors printing. If they're able to be ignored or suppressed that would be valid for now.

Q: If I use any optimization like -O1 or -O2 I instead end up with a segmentation fault, could this be related to the UCX configuration?

wsdal commented 6 months ago

Still trying to put together a minimalistic main function to reproduce the issue. Taking a deeper dive into my application shows the error occurring when closing the MCA framework for OSHMEM heap memory.

Framework info:
        framework_project = oshmem
        framework_name = memheap
        framework_description = OSHMEM MEMHEAP
[1709138618.390228] [GPU-LNX-CLUSTER-01:1582149:0]            event.c:244  UCX  TRACE ucm_vm_munmap(addr=0x7f86d5e00000 length=270532608)
[1709138618.390230] [GPU-LNX-CLUSTER-01:1582149:0]            event.h:73   UCX  TRACE vm_unmap addr=0x7f86d5e00000 length=270532608
[1709138618.390256] [GPU-LNX-CLUSTER-01:1582149:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709138618.390262] [GPU-LNX-CLUSTER-01:1582149:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error
[1709138618.390268] [GPU-LNX-CLUSTER-01:1582149:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709138618.390270] [GPU-LNX-CLUSTER-01:1582149:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error

During shmem_init call, I see the src/ucm/cuda/cudamem.c dispatching initial memory type allocations, but the pointer passed to cudaHostUnregister doesn't appear to fall between these memory regions.

shmem_init debug output ``` shmem_init [1709138616.869569] [GPU-LNX-CLUSTER-01:1582149:0] ucp_context.c:2119 UCX INFO Version 1.15.0 (loaded from /home/localadmin/dal/local/ucx-1.15.0/lib/libucp.so.0) [1709138616.869948] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.869963] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+163840) [1709138616.869989] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c98ff000 length=163840 [1709138616.871725] [GPU-LNX-CLUSTER-01:1582149:0] install.c:292 UCX DEBUG testing mmap existing events 0x0 [1709138616.871736] [GPU-LNX-CLUSTER-01:1582149:0] install.c:298 UCX DEBUG mmap existing events test: got 0x0 out of 0x0 [1709138616.871741] [GPU-LNX-CLUSTER-01:1582149:0] event.c:538 UCX DEBUG mmap hooks are ready [1709138616.871745] [GPU-LNX-CLUSTER-01:1582149:0] malloc_hook.c:585 UCX DEBUG ucs_malloc_is_ready(before test): have 0x0/0x0 events; mmap_mode=2 hook_called=0 [1709138616.871750] [GPU-LNX-CLUSTER-01:1582149:0] event.c:548 UCX DEBUG malloc hooks are ready [1709138616.871760] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:279 UCX DEBUG cuda memory hooks mode reloc is disabled for driver API [1709138616.871764] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:279 UCX DEBUG cuda memory hooks mode reloc is disabled for runtime API [1709138616.871768] [GPU-LNX-CLUSTER-01:1582149:0] event.c:620 UCX DEBUG added user handler (func=0x7f871afe7cd0 arg=0x5591c98c02b0) for events=0x220000 prio=500 [1709138616.871850] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.871856] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+192512) [1709138616.871869] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9927000 length=192512 [1709138616.876555] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.876558] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+151552) [1709138616.876560] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9956000 length=151552 [1709138616.876652] [GPU-LNX-CLUSTER-01:1582149:0] ucp_worker.c:1855 UCX INFO 0x5591c98a20c0 self cfg#1 rma(self/memory) amo(self/memory) [1709138616.876678] [GPU-LNX-CLUSTER-01:1582149:0] install.c:292 UCX DEBUG testing mmap external events 0x20000 [1709138616.876682] [GPU-LNX-CLUSTER-01:1582149:0] install.c:163 UCX TRACE after p = mmap(((void *)0), ucm_get_page_size(), 0x1 | 0x2, 0x02 | 0x20, -1, 0): got 0x0/0x0 [1709138616.876686] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.876687] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.876698] [GPU-LNX-CLUSTER-01:1582149:0] install.c:168 UCX TRACE after p = mremap(p, ucm_get_page_size(), ucm_get_page_size() * 2, 1): got 0x20000/0x20000 [1709138616.876700] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f8717202000 length=8192) [1709138616.876701] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f8717202000 length=8192 [1709138616.876708] [GPU-LNX-CLUSTER-01:1582149:0] install.c:172 UCX TRACE after p = mremap(p, ucm_get_page_size() * 2, ucm_get_page_size(), 0): got 0x20000/0x20000 [1709138616.876709] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f8717202000 length=4096) [1709138616.876711] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f8717202000 length=4096 [1709138616.876715] [GPU-LNX-CLUSTER-01:1582149:0] install.c:176 UCX TRACE after p = mmap(p, ucm_get_page_size(), 0x1 | 0x2, 0x10 | 0x02 | 0x20, -1, 0): got 0x20000/0x0 [1709138616.876717] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f8717202000 length=4096) [1709138616.876718] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f8717202000 length=4096 [1709138616.876720] [GPU-LNX-CLUSTER-01:1582149:0] install.c:179 UCX TRACE after munmap(p, ucm_get_page_size()): got 0x20000/0x20000 [1709138616.876726] [GPU-LNX-CLUSTER-01:1582149:0] install.c:190 UCX TRACE after p = shmat(shmid, ((void *)0), 0): got 0x0/0x0 [1709138616.876728] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.876729] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.876733] [GPU-LNX-CLUSTER-01:1582149:0] install.c:193 UCX TRACE after p = shmat(shmid, p, 040000): got 0x20000/0x20000 [1709138616.876853] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.876855] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.876860] [GPU-LNX-CLUSTER-01:1582149:0] install.c:197 UCX TRACE after shmdt(p): got 0x20000/0x20000 [1709138616.876864] [GPU-LNX-CLUSTER-01:1582149:0] install.c:227 UCX TRACE after p = mmap(((void *)0), ucm_get_page_size(), 0x1|0x2, 0x02|0x20, -1, 0): got 0x0/0x0 [1709138616.876866] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.876867] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.876870] [GPU-LNX-CLUSTER-01:1582149:0] install.c:231 UCX TRACE after madvise(p, ucm_get_page_size(), 4): got 0x20000/0x20000 [1709138616.876872] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.876873] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.876875] [GPU-LNX-CLUSTER-01:1582149:0] install.c:233 UCX TRACE after munmap(p, ucm_get_page_size()): got 0x20000/0x20000 [1709138616.876877] [GPU-LNX-CLUSTER-01:1582149:0] install.c:298 UCX DEBUG mmap external events test: got 0x20000 out of 0x20000 [1709138616.877020] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87183c3000 length=4096) [1709138616.877021] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87183c3000 length=4096 [1709138616.877057] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0xff000000 length=270532608) [1709138616.877059] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0xff000000 length=270532608 [1709138616.877066] [GPU-LNX-CLUSTER-01:1582149:0] install.c:292 UCX DEBUG testing mmap existing events 0x0 [1709138616.877067] [GPU-LNX-CLUSTER-01:1582149:0] install.c:298 UCX DEBUG mmap existing events test: got 0x0 out of 0x0 [1709138616.877069] [GPU-LNX-CLUSTER-01:1582149:0] event.c:538 UCX DEBUG mmap hooks are ready [1709138616.877070] [GPU-LNX-CLUSTER-01:1582149:0] malloc_hook.c:585 UCX DEBUG ucs_malloc_is_ready(before test): have 0x0/0x0 events; mmap_mode=2 hook_called=0 [1709138616.877072] [GPU-LNX-CLUSTER-01:1582149:0] event.c:548 UCX DEBUG malloc hooks are ready [1709138616.877073] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:279 UCX DEBUG cuda memory hooks mode reloc is disabled for driver API [1709138616.877074] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:279 UCX DEBUG cuda memory hooks mode reloc is disabled for runtime API [1709138616.877304] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x200000000..0x300200000 [1709138616.877315] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f86f0000000..0x7f8700000000 [1709138616.877318] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8702000000..0x7f8708000000 [1709138616.877320] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8708021000..0x7f870c000000 [1709138616.877324] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f870cddd000..0x7f870cdde000 [1709138616.877326] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f870d5de000..0x7f870e600000 [1709138616.877329] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f870efb4000..0x7f870f1b4000 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08 [1709138616.877333] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8710021000..0x7f8714000000 [1709138616.877337] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8714606000..0x7f8714805000 /usr/local/cuda-12.3/targets/x86_64-linux/lib/libOpenCL.so.1.0.0 [1709138616.877353] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8716ca7000..0x7f8716cb7000 /dev/nvidia0 [1709138616.877357] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8716d7a000..0x7f8716d7b000 /home/localadmin/dal/local/ucx-1.15.0/lib/ucx/libuct_ib.so.0.0.0 [1709138616.877361] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8716d97000..0x7f8716d98000 /home/localadmin/dal/local/ucx-1.15.0/lib/ucx/libuct_cuda.so.0.0.0 [1709138616.877365] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8716dbd000..0x7f8716dbe000 /usr/lib/x86_64-linux-gnu/libbsd.so.0.11.5 [1709138616.877374] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f87173ff000..0x7f8717400000 [1709138616.877378] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8717dea000..0x7f8717deb000 /home/localadmin/dal/local/openmpi-5.0.2/lib/libpmix.so.2.9.4 [1709138616.877385] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8717ff4000..0x7f8717ff5000 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.13 [1709138616.877391] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8718215000..0x7f8718216000 /usr/lib/x86_64-linux-gnu/libc.so.6 [1709138616.877398] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8718375000..0x7f8718376000 /usr/lib/x86_64-linux-gnu/libX11.so.6.4.0 [1709138616.877403] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f87183bf000..0x7f87183c0000 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11 [1709138616.877411] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871861a000..0x7f871861b000 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30 [1709138616.877419] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871866b000..0x7f871866c000 /usr/lib/x86_64-linux-gnu/libudev.so.1.7.2 [1709138616.877423] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8718689000..0x7f871868a000 /home/localadmin/dal/local/ucx-1.15.0/lib/libucm.so.0.0.0 [1709138616.877426] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f87187f7000..0x7f87187f8000 /home/localadmin/dal/local/ucx-1.15.0/lib/libucp.so.0.0.0 [1709138616.877430] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f87188a6000..0x7f8718aa6000 /usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so.12.3.101 [1709138616.877437] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f8718b08000..0x7f8718b09000 /home/localadmin/dal/local/ucx-1.15.0/lib/libuct.so.0.0.0 [1709138616.877446] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871a6f8000..0x7f871a6f9000 /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08 [1709138616.877453] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871a91b000..0x7f871a91c000 /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7.0.1 [1709138616.877457] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871aa20000..0x7f871aa21000 /home/localadmin/dal/local/openmpi-5.0.2/lib/libopen-pal.so.80.0.2 [1709138616.877461] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871abe9000..0x7f871abea000 /home/localadmin/dal/local/openmpi-5.0.2/lib/liboshmem.so.40.40.1 [1709138616.877465] [GPU-LNX-CLUSTER-01:1582149:0] cudamem.c:376 UCX DEBUG dispatching initial memtype allocation for 0x7f871af1d000..0x7f871af1e000 /home/localadmin/dal/local/openmpi-5.0.2/lib/libmpi.so.40.40.2 [1709138616.877482] [GPU-LNX-CLUSTER-01:1582149:0] event.c:620 UCX DEBUG added user handler (func=0x7f871afe5590 arg=0x5591c994a210) for events=0x300000 prio=1000 [1709138616.877656] [GPU-LNX-CLUSTER-01:1582149:9] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ec000000 length=67108864) [1709138616.877659] [GPU-LNX-CLUSTER-01:1582149:9] event.h:73 UCX TRACE vm_unmap addr=0x7f86ec000000 length=67108864 [1709138616.878130] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ec001000 length=33550336) [1709138616.878133] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ec001000 length=33550336 [1709138616.878195] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.878197] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+253952) [1709138616.878200] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c997b000 length=253952 [1709138616.878441] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x200200000 length=2097152) [1709138616.878444] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x200200000 length=2097152 [1709138616.887454] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x200400000 length=58720256) [1709138616.887457] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x200400000 length=58720256 [1709138616.888821] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007de000 length=4096) [1709138616.888823] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007de000 length=4096 [1709138616.889321] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204a00000 length=2097152) [1709138616.889324] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x204a00000 length=2097152 [1709138616.889604] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204c00000 length=2097152) [1709138616.889607] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x204c00000 length=2097152 [1709138616.889718] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.889720] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+176128) [1709138616.889723] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c99b9000 length=176128 [1709138616.889796] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007df000 length=4096) [1709138616.889798] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007df000 length=4096 [1709138616.890086] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e0000 length=4096) [1709138616.890088] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e0000 length=4096 [1709138616.890276] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.890278] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+184320) [1709138616.890281] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c99e4000 length=184320 [1709138616.890316] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e1000 length=4096) [1709138616.890317] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e1000 length=4096 [1709138616.890580] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e2000 length=4096) [1709138616.890582] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e2000 length=4096 [1709138616.890826] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e3000 length=4096) [1709138616.890828] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e3000 length=4096 [1709138616.890841] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.890843] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.890845] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9a11000 length=135168 [1709138616.891093] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e4000 length=4096) [1709138616.891095] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e4000 length=4096 [1709138616.891345] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e5000 length=4096) [1709138616.891347] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e5000 length=4096 [1709138616.893709] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.893712] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+262144) [1709138616.893716] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9a32000 length=262144 [1709138616.893771] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e6000 length=4096) [1709138616.893773] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e6000 length=4096 [1709138616.893999] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e7000 length=4096) [1709138616.894002] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e7000 length=4096 [1709138616.894212] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.894220] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+184320) [1709138616.894224] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9a72000 length=184320 [1709138616.894277] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e8000 length=4096) [1709138616.894279] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e8000 length=4096 [1709138616.894544] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007e9000 length=4096) [1709138616.894546] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007e9000 length=4096 [1709138616.895021] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.895024] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+208896) [1709138616.895028] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9a9f000 length=208896 [1709138616.895100] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007ea000 length=4096) [1709138616.895102] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007ea000 length=4096 [1709138616.895338] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007eb000 length=4096) [1709138616.895341] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007eb000 length=4096 [1709138616.895535] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.895537] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+184320) [1709138616.895540] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9ad2000 length=184320 [1709138616.895583] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007ec000 length=4096) [1709138616.895584] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007ec000 length=4096 [1709138616.895815] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007ed000 length=4096) [1709138616.895817] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007ed000 length=4096 [1709138616.896239] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.896242] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+208896) [1709138616.896245] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9aff000 length=208896 [1709138616.896314] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007ee000 length=4096) [1709138616.896316] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007ee000 length=4096 [1709138616.896544] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007ef000 length=4096) [1709138616.896547] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007ef000 length=4096 [1709138616.896709] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.896711] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+184320) [1709138616.896715] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9b32000 length=184320 [1709138616.896752] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007f0000 length=4096) [1709138616.896754] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007f0000 length=4096 [1709138616.896964] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f87007f1000 length=4096) [1709138616.896966] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f87007f1000 length=4096 [1709138616.897526] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x204e00000 length=2097152) [1709138616.897529] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x204e00000 length=2097152 [1709138616.920583] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.920587] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.920593] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9b5f000 length=135168 [1709138616.921026] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205000000 length=2097152) [1709138616.921029] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x205000000 length=2097152 [1709138616.921270] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205200000 length=2097152) [1709138616.921273] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x205200000 length=2097152 [1709138616.921358] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.921359] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+143360) [1709138616.921362] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9b80000 length=143360 [1709138616.921975] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ee400000 length=2097152) [1709138616.921978] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ee400000 length=2097152 [1709138616.922064] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ee600000 length=2097152) [1709138616.922065] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ee600000 length=2097152 [1709138616.922767] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205600000 length=2097152) [1709138616.922770] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x205600000 length=2097152 [1709138616.922944] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.922946] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.922950] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9ba3000 length=135168 [1709138616.922992] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ee800000 length=2097152) [1709138616.922993] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ee800000 length=2097152 [1709138616.924022] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x205a00000 length=2097152) [1709138616.924025] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x205a00000 length=2097152 [1709138616.924712] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.924715] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.924719] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9bc4000 length=135168 [1709138616.924990] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ef000000 length=2097152) [1709138616.924992] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ef000000 length=2097152 [1709138616.925599] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.925603] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+163840) [1709138616.925606] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9be5000 length=163840 [1709138616.925772] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.925773] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+163840) [1709138616.925776] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9c0d000 length=163840 [1709138616.925899] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.925900] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.925902] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9c35000 length=135168 [1709138616.925989] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.925991] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+204800) [1709138616.925993] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9c56000 length=204800 [1709138616.926078] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926079] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+143360) [1709138616.926081] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9c88000 length=143360 [1709138616.926165] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926166] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+139264) [1709138616.926168] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9cab000 length=139264 [1709138616.926257] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926259] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+147456) [1709138616.926261] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9ccd000 length=147456 [1709138616.926390] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926391] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.926393] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9cf1000 length=135168 [1709138616.926435] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926436] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+151552) [1709138616.926438] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9d12000 length=151552 [1709138616.926541] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926543] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+155648) [1709138616.926549] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9d37000 length=155648 [1709138616.926645] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.926646] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.926648] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9d5d000 length=135168 [1709138616.927105] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.927108] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.927110] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9d7e000 length=135168 [1709138616.928010] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f870c1aa000 length=2101248) [1709138616.928014] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f870c1aa000 length=2101248 [1709138616.928099] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928101] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+475136) [1709138616.928104] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9d9f000 length=475136 [1709138616.928331] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928333] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.928335] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9e13000 length=135168 [1709138616.928406] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928407] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.928410] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9e34000 length=135168 [1709138616.928491] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928493] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.928495] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9e55000 length=135168 [1709138616.928565] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928566] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.928568] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9e76000 length=135168 [1709138616.928645] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.928646] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+135168) [1709138616.928648] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9e97000 length=135168 [1709138616.929685] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.929688] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+2162688) [1709138616.929689] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591c9eb8000 length=2162688 [1709138616.929950] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ef200000 length=2097152) [1709138616.929952] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ef200000 length=2097152 [1709138616.930663] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86ef400000 length=1536000) [1709138616.930667] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86ef400000 length=1536000 [1709138616.931210] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.931213] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+479232) [1709138616.931216] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca0c8000 length=479232 [1709138616.931346] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.931348] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+172032) [1709138616.931351] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca13d000 length=172032 [1709138616.931599] [GPU-LNX-CLUSTER-01:1582149:0] event.c:244 UCX TRACE ucm_vm_munmap(addr=0x7f86e4001000 length=33550336) [1709138616.931602] [GPU-LNX-CLUSTER-01:1582149:0] event.h:73 UCX TRACE vm_unmap addr=0x7f86e4001000 length=33550336 [1709138616.931886] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.931889] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+1646592) [1709138616.931892] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca167000 length=1646592 [1709138616.931999] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.932000] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+237568) [1709138616.932003] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca2f9000 length=237568 [1709138616.932085] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.932086] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+409600) [1709138616.932088] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca333000 length=409600 [1709138616.932182] [GPU-LNX-CLUSTER-01:1582149:0] replace.c:58 UCX TRACE ucm_override_sbrk() [1709138616.932183] [GPU-LNX-CLUSTER-01:1582149:0] event.c:390 UCX TRACE ucm_sbrk(increment=+225280) [1709138616.932186] [GPU-LNX-CLUSTER-01:1582149:0] event.h:61 UCX TRACE vm_map addr=0x5591ca397000 length=225280 END shmem_init ```
wsdal commented 6 months ago

@roiedanino I have made a minimalistic C file and CUDA file to reproduce the issue.

CUDA source file ``` #include typedef struct cudaDeviceInfo{ int devID; cudaDeviceProp *deviceProp; int smMajor; int smMinor; bool inited; bool reseted; } cudaDeviceInfo; __global__ void helloCUDA_kernel() { printf("Hello, World! from CUDA thread %d\n", threadIdx.x); } extern "C" void helloCUDA() { helloCUDA_kernel<<<1, 1>>>(); cudaDeviceSynchronize(); } extern "C" void queryDevice() { int driverVersion, runtimeVersion; cudaDriverGetVersion(&driverVersion); cudaRuntimeGetVersion(&runtimeVersion); int driverMajor = driverVersion/1000; int driverMinor = (driverVersion%100)/10; int runtimeMajor = runtimeVersion/1000; int runtimeMinor = (runtimeVersion%100)/10; int m_numDevicesForUse; int m_numTotalDevices; cudaDeviceInfo m_deviceInfoArray[4]; int m_deviceForUseArray[4]; cudaGetDeviceCount(&m_numTotalDevices); m_numDevicesForUse = 0; for(int i=0; i<4;i++) m_deviceForUseArray[i] = -1; for(int i=0; i<4; i++){ cudaSetDevice(i); cudaDeviceProp *devProp = (cudaDeviceProp*) malloc(sizeof(cudaDeviceProp)); cudaGetDeviceProperties(devProp, i); if(devProp->major<2){ free(devProp); cudaDeviceReset(); // Commenting out cudaDeviceReset() appears to rid of the errors continue; } else{ m_deviceForUseArray[m_numDevicesForUse] = i; m_deviceInfoArray[m_numDevicesForUse].devID = i; m_deviceInfoArray[m_numDevicesForUse].smMajor = devProp->major; m_deviceInfoArray[m_numDevicesForUse].smMinor = devProp->minor; m_deviceInfoArray[m_numDevicesForUse].deviceProp = devProp; m_deviceInfoArray[m_numDevicesForUse].inited = false; m_deviceInfoArray[m_numDevicesForUse].reseted = false; m_numDevicesForUse++; cudaDeviceReset(); // Commenting out cudaDeviceReset() appears to rid of the errors } } } ```
C source file ``` #include #include #include "shmem.h" #include "cuda.h" #include "cuda_runtime_api.h" void helloCUDA(); void queryDevice(); int main(int argc, char *argv[]) { int my_rank, num_ranks; int provided=0; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); shmem_init(); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); printf("Hello, World! from MPI rank %d of %d\n", my_rank, num_ranks); queryDevice(); // Error occurs when I add this code segment, see comments in CUDA file // helloCUDA(); // hello world kernel executed fine shmem_barrier_all(); MPI_Barrier(MPI_COMM_WORLD); shmem_finalize(); return 0; } ```

From the comments in the files, I note that the error appears to stem out of the cudaDeviceReset() API call. As previously mentioned, this application was code handed to me by someone else so there is a chance that this API call is causing some corruption in memory by hard forcing a reset on a device while it's memory is still associated with OpenSHMEM.

This is definitely something to look into. I am still interested in finding out why this issue exists: Q: If I use any optimization like -O1 or -O2 when building my application, I end up with a segmentation fault, could this be related to the UCX configuration? I'm being forced into a position of using -O0.

wsdal commented 6 months ago

Per NVIDIA documentation on CUDA runtime API interactions with the CUDA driver API:

There exists a one to one relationship between CUDA devices in the CUDA Runtime API and CUcontexts in the CUDA Driver API within a process. The specific context which the CUDA Runtime API uses for a device is called the device's primary context. From the perspective of the CUDA Runtime API, a device and its primary context are synonymous.

Primary contexts will remain active until they are explicitly deinitialized using cudaDeviceReset(). The function cudaDeviceReset() will deinitialize the primary context for the calling thread's current device immediately. The context will remain current to all of the threads that it was current to. The next CUDA Runtime API call on any thread which requires an active context will trigger the reinitialization of that device's primary context.

Note that primary contexts are shared resources. It is recommended that the primary context not be reset except just before exit or to recover from an unspecified launch failure.

I believe that during the SHMEM initialization the UCM component intercepts the CUDA allocations and I think associates them to it's context. The problem then comes from calling cudaResetDevice() after this initialization and allocations are made, and since it destroys the primary context, that pointer isn't apart of the process maps, or the CUDA memory regions intercepted by UCX's UCM and passed to SHMEM. So then at the end of the application when finalizing SHMEM, the pointer cannot be released.

This is what I believe might be happening. I personally didn't see it but maybe it is worth noting that this kind of functionality shouldn't be used or allowed in a UCX+MPI+SHMEM+CUDA environment if someone looks further into this.

roiedanino commented 6 months ago

Thank you @wsdal, will look into this

roiedanino commented 6 months ago

Can you please try using UCX_MEM_CUDA_HOOK_MODE=none and see if it solves the issue?

wsdal commented 6 months ago

@roiedanino I tried that and it did not resolve it.

$ mpirun --map-by node -np 1 -mca pml ucx -x UCX_MEM_CUDA_HOOK_MODE=none ./hello_mpi_cuda
Hello, World! from MPI rank 0 of 1
[1709638759.672065] [GPU-LNX-CLUSTER-01:1723628:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709638759.672078] [GPU-LNX-CLUSTER-01:1723628:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error
[1709638759.672088] [GPU-LNX-CLUSTER-01:1723628:0]    cuda_copy_md.c:182  UCX  ERROR cudaHostUnregister(address)() failed: pointer does not correspond to a registered memory region
[1709638759.672090] [GPU-LNX-CLUSTER-01:1723628:0]          ucp_mm.c:332  UCX  WARN  failed to dereg from md[4]=cuda_cpy: Input/output error
roiedanino commented 6 months ago

So I don't think the issue here is related to UCM interceptions but to the fact that shmem is allocating a symmetric heap using UCX to be used for communication, and by resetting the device the allocated memory and its pointers are no longer valid and SHMEM have no way of knowing about the reset.

wsdal commented 6 months ago

That sounds valid based on what I see when calling the device reset. For now I'll gauge how important that call is in the application space and try to figure out a different solution if required.

roiedanino commented 5 months ago

Hi @wsdal, any updates? Can we close the issue?

wsdal commented 5 months ago

If there is nothing that should be resolved for the call of cudaResetDevice() in an UCX+MPI+SHMEM environment, then the issue can be closed. Maybe add to the documentation that the call shouldn't be used?

roiedanino commented 5 months ago

I'm not sure this should be a part of the documentation, because resetting a device inside a shmem program is not common enough to mention, also the shmem_init() documentation mentions the allocation of symmetric heap, and we could expect that corruption of pointers allocated in shmem_init will cause issues in shmem_finalize:

shmem_init, start_pes - Allocates a block of memory from the symmetric heap.

A call to shmem_finalize will release any resources initialized by a corresponding call to shmem_init