openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

py_dl_open calls ucm_dlopen which causes a segfault on master #4190

Closed Akshay-Venkatesh closed 5 years ago

Akshay-Venkatesh commented 5 years ago

I see this issue on 5cc6282 and it may be related to https://github.com/openucx/ucx/issues/564 In this test, UCX is called from python. The test goes through listener creation and endpoint setup fine but seems to crash just before an attempt to send is made but even before we get into UCX because py_dl_open traps into ucm_dlopen where a segfault occurs (maybe because of filename=0x0?). I can send the test details if it's of interest but are there any initial thoughts on how to get around this issue? Thanks in advance

[akvenkatesh@hsw210 ucx-py]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,cuda_copy python3 benchmarks/old_tests/send-recv-py-obj.py -o cupy
[1568337683.954607] [hsw210:2577 :0]         parser.c:1568 UCX  WARN  unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n 
to suppress this warning)
listening at port 13337
about to send
[hsw210:2577 :0:2577] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /home/akvenkatesh/ucxpy/install/lib/libucs.so.0(+0x291bb) [0x2acd1b7cc1bb]
    1  /home/akvenkatesh/ucxpy/install/lib/libucs.so.0(+0x29301) [0x2acd1b7cc301]
    2  /usr/lib64/libpthread.so.0(+0xf6d0) [0x2acd100986d0]
    3  /home/akvenkatesh/ucxpy/install/lib/libucm.so.0(ucm_dlopen+0xea) [0x2acd1b59161f]
    4  /home/akvenkatesh/ucxpy/install/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x103e9) [0x2acd834a73e9
#0  0x00002acd10091f97 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x00002acd40313f8a in blas_thread_shutdown_ ()
   from /home/akvenkatesh/ucxpy/install/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so
#2  0x00002acd10a735ec in fork () from /usr/lib64/libc.so.6
#3  0x00002acd1b7cb4e8 in ucs_debugger_attach () at ../../../src/ucs/debug/debug.c:633
#4  0x00002acd1b7cbdb1 in ucs_error_freeze (message=0x2acd1b7eb624 "address not mapped to object")
    at ../../../src/ucs/debug/debug.c:822
#5  0x00002acd1b7cc400 in ucs_handle_error (message=0x2acd1b7eb624 "address not mapped to object")
    at ../../../src/ucs/debug/debug.c:992
#6  0x00002acd1b7cc1bb in ucs_debug_handle_error_signal (signo=11, cause=0x2acd1b7eb624 "address not mapped to object", 
    fmt=0x2acd1b7eb7ad " at address %p") at ../../../src/ucs/debug/debug.c:941
#7  0x00002acd1b7cc301 in ucs_error_signal_handler (signo=11, info=0x2acd0f85d5f0, context=0x2acd0f85d4c0)
    at ../../../src/ucs/debug/debug.c:963
#8  <signal handler called>
#9  0x00002acd1b59161f in ucm_dlopen (filename=0x0, flag=2) at ../../../src/ucm/util/reloc.c:389
#10 0x00002acd834a73e9 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /home/akvenkatesh/ucxpy/Python-3.7.4/Modules/_ctypes/callproc.c:1365
#11 0x00002acd0f986391 in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=<optimized out>, args=0x2acd2ffb0e08, 
    nargs=2, kwnames=<optimized out>) at Objects/call.c:698
Server:
[1568338461.976659] [hsw210:3468 :0]     ucp_worker.c:1998 UCX  TRACE ucp_worker_arm returning Device is busy
[1568338461.976666] [hsw210:3468 :0]       ib_iface.c:1174 UCX  TRACE arm_cq: got 0 send and 1 recv events, returning BUSY
[1568338461.976668] [hsw210:3468 :0]     ucp_worker.c:1987 UCX  TRACE arm iface 0x2686250 returned Device is busy
[1568338461.976671] [hsw210:3468 :0]     ucp_worker.c:1998 UCX  TRACE ucp_worker_arm returning Device is busy
[1568338461.976674] [hsw210:3468 :0]     ucp_worker.c:1987 UCX  TRACE arm iface 0x2686250 returned Success
[1568338461.976676] [hsw210:3468 :0]     ucp_worker.c:1998 UCX  TRACE ucp_worker_arm returning Success
about to send
Client:
[1568338461.976329] [hsw210:3499 :0]         wireup.c:489  UCX  TRACE ep 0x2b7e4b006070: sending wireup ack
[1568338461.976360] [hsw210:3499 :0]     ucp_worker.c:728  UCX  TRACE wiface 0x2378ba0 progress returned 1, but no active messages were received
[1568338461.976461] [hsw210:3499 :0]     ucp_worker.c:1998 UCX  TRACE ucp_worker_arm returning Device is busy
[1568338461.976472] [hsw210:3499 :0]       ib_iface.c:1174 UCX  TRACE arm_cq: got 0 send and 1 recv events, returning BUSY
[1568338461.976474] [hsw210:3499 :0]     ucp_worker.c:723  UCX  TRACE arm iface 0x23b6ed0 returned BUSY
[1568338461.976485] [hsw210:3499 :0]     ucp_worker.c:1987 UCX  TRACE arm iface 0x2392390 returned Success
[1568338461.976488] [hsw210:3499 :0]     ucp_worker.c:1998 UCX  TRACE ucp_worker_arm returning Success
hoopoepg commented 5 years ago

hi @Akshay-Venkatesh could you try this fix: https://github.com/openucx/ucx/pull/4191 to check if it helps?

thank you

Akshay-Venkatesh commented 5 years ago

Thanks a lot @hoopoepg . That worked!