I see this issue on 5cc6282 and it may be related to https://github.com/openucx/ucx/issues/564
In this test, UCX is called from python. The test goes through listener creation and endpoint setup fine but seems to crash just before an attempt to send is made but even before we get into UCX because py_dl_open traps into ucm_dlopen where a segfault occurs (maybe because of filename=0x0?). I can send the test details if it's of interest but are there any initial thoughts on how to get around this issue? Thanks in advance
[akvenkatesh@hsw210 ucx-py]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,cuda_copy python3 benchmarks/old_tests/send-recv-py-obj.py -o cupy
[1568337683.954607] [hsw210:2577 :0] parser.c:1568 UCX WARN unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n
to suppress this warning)
listening at port 13337
about to send
[hsw210:2577 :0:2577] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
0 /home/akvenkatesh/ucxpy/install/lib/libucs.so.0(+0x291bb) [0x2acd1b7cc1bb]
1 /home/akvenkatesh/ucxpy/install/lib/libucs.so.0(+0x29301) [0x2acd1b7cc301]
2 /usr/lib64/libpthread.so.0(+0xf6d0) [0x2acd100986d0]
3 /home/akvenkatesh/ucxpy/install/lib/libucm.so.0(ucm_dlopen+0xea) [0x2acd1b59161f]
4 /home/akvenkatesh/ucxpy/install/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x103e9) [0x2acd834a73e9
#0 0x00002acd10091f97 in pthread_join () from /usr/lib64/libpthread.so.0
#1 0x00002acd40313f8a in blas_thread_shutdown_ ()
from /home/akvenkatesh/ucxpy/install/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so
#2 0x00002acd10a735ec in fork () from /usr/lib64/libc.so.6
#3 0x00002acd1b7cb4e8 in ucs_debugger_attach () at ../../../src/ucs/debug/debug.c:633
#4 0x00002acd1b7cbdb1 in ucs_error_freeze (message=0x2acd1b7eb624 "address not mapped to object")
at ../../../src/ucs/debug/debug.c:822
#5 0x00002acd1b7cc400 in ucs_handle_error (message=0x2acd1b7eb624 "address not mapped to object")
at ../../../src/ucs/debug/debug.c:992
#6 0x00002acd1b7cc1bb in ucs_debug_handle_error_signal (signo=11, cause=0x2acd1b7eb624 "address not mapped to object",
fmt=0x2acd1b7eb7ad " at address %p") at ../../../src/ucs/debug/debug.c:941
#7 0x00002acd1b7cc301 in ucs_error_signal_handler (signo=11, info=0x2acd0f85d5f0, context=0x2acd0f85d4c0)
at ../../../src/ucs/debug/debug.c:963
#8 <signal handler called>
#9 0x00002acd1b59161f in ucm_dlopen (filename=0x0, flag=2) at ../../../src/ucm/util/reloc.c:389
#10 0x00002acd834a73e9 in py_dl_open (self=<optimized out>, args=<optimized out>)
at /home/akvenkatesh/ucxpy/Python-3.7.4/Modules/_ctypes/callproc.c:1365
#11 0x00002acd0f986391 in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=<optimized out>, args=0x2acd2ffb0e08,
nargs=2, kwnames=<optimized out>) at Objects/call.c:698
Server:
[1568338461.976659] [hsw210:3468 :0] ucp_worker.c:1998 UCX TRACE ucp_worker_arm returning Device is busy
[1568338461.976666] [hsw210:3468 :0] ib_iface.c:1174 UCX TRACE arm_cq: got 0 send and 1 recv events, returning BUSY
[1568338461.976668] [hsw210:3468 :0] ucp_worker.c:1987 UCX TRACE arm iface 0x2686250 returned Device is busy
[1568338461.976671] [hsw210:3468 :0] ucp_worker.c:1998 UCX TRACE ucp_worker_arm returning Device is busy
[1568338461.976674] [hsw210:3468 :0] ucp_worker.c:1987 UCX TRACE arm iface 0x2686250 returned Success
[1568338461.976676] [hsw210:3468 :0] ucp_worker.c:1998 UCX TRACE ucp_worker_arm returning Success
about to send
Client:
[1568338461.976329] [hsw210:3499 :0] wireup.c:489 UCX TRACE ep 0x2b7e4b006070: sending wireup ack
[1568338461.976360] [hsw210:3499 :0] ucp_worker.c:728 UCX TRACE wiface 0x2378ba0 progress returned 1, but no active messages were received
[1568338461.976461] [hsw210:3499 :0] ucp_worker.c:1998 UCX TRACE ucp_worker_arm returning Device is busy
[1568338461.976472] [hsw210:3499 :0] ib_iface.c:1174 UCX TRACE arm_cq: got 0 send and 1 recv events, returning BUSY
[1568338461.976474] [hsw210:3499 :0] ucp_worker.c:723 UCX TRACE arm iface 0x23b6ed0 returned BUSY
[1568338461.976485] [hsw210:3499 :0] ucp_worker.c:1987 UCX TRACE arm iface 0x2392390 returned Success
[1568338461.976488] [hsw210:3499 :0] ucp_worker.c:1998 UCX TRACE ucp_worker_arm returning Success
I see this issue on 5cc6282 and it may be related to https://github.com/openucx/ucx/issues/564 In this test, UCX is called from python. The test goes through listener creation and endpoint setup fine but seems to crash just before an attempt to send is made but even before we get into UCX because py_dl_open traps into ucm_dlopen where a segfault occurs (maybe because of
filename=0x0
?). I can send the test details if it's of interest but are there any initial thoughts on how to get around this issue? Thanks in advance