ucp active messages take an excessive number of ucp_worker_progress calls

I have a test case which calls MPI_Ibcast to do a nonblocking broadcast of a 32MB buffer around 4 ranks on 4 nodes, over Infiniband. I wait for completion of the MPI_Ibcast with a loop which does some floating-point work and then calls MPI_Testall. I am using an IBM proprietary collectives library which works by sending active messages, which can run over either IBM PAMI or UCP. By comparison with the IBM PAMI implementation, the UCP implementation requires about 10x as many calls to 'advance' (ucp_worker_progress). This results in a 10x longer elapsed time to complete the broadcasts. I am running with a 'release' build of UCX from the current master branch. Is UCP breaking the buffer into short segments, and requiring a call to ucp_worker_progress for each segment ? Here run_abbrev.log is a trace of the run; if you search for 'advancecount' you will see a count of the number of calls to ucp_worker_progress at each stage.

With some further testing, it looks like there is some threshold length between 4096 bytes and 65536 bytes such that 'knem' ucp_put_nb transfers shorter than this length are either not done or are misdirected, and 'knem' transfers longer than or equal to the threshold are done correctly. If this is the problem, I can code around it. @yosefe @snyjm-18 do you know where I can report issues, for 'knem' on POWER ? I presume the 'gtest' isn't picking up this problem because 'gtest' uses a single process; maybe ucp_put_nb uses 'memcpy' instead of 'knem' in this case, or maybe 'knem' works differently when the source and target are in the same process.

When I scale up my test case to 120 processes on 6 nodes, I get a segmentation violation in one rank. The message is

[1,0]<stderr>:[c712f6n01:157754:0:157754] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x451d287067bdf51c)

and the backtrace is

(gdb) where
#0  0x000010000062f568 in pause () from /lib64/libc.so.6
#1  0x000010000310aae4 in ucs_debug_freeze () at debug/debug.c:710
#2  0x000010000310e5f0 in ucs_error_freeze (
    message=0x10000312eda8 "address not mapped to object") at debug/debug.c:829
#3  ucs_handle_error (message=0x10000312eda8 "address not mapped to object")
    at debug/debug.c:992
#4  0x000010000310e9e8 in ucs_debug_handle_error_signal (signo=11, 
    cause=0x10000312eda8 "address not mapped to object", 
    fmt=0x10000312ee98 " at address %p") at debug/debug.c:941
#5  0x000010000310ed44 in ucs_error_signal_handler (signo=<optimized out>, 
    info=0x1000031c4778, context=<optimized out>) at debug/debug.c:963
#6  <signal handler called>
#7  ucp_rma_basic_progress_put (self=0x1000e4b1aa8) at rma/rma_basic.c:28
#8  0x00001000211ed840 in uct_dc_mlx5_iface_dci_do_common_pending_tx (
    ep=0x1000e3984b0, elem=0x1000e4b1ab0) at dc/dc_mlx5_ep.c:1069
#9  0x00001000211eda1c in uct_dc_mlx5_iface_dci_do_dcs_pending_tx (
    arbiter=<optimized out>, elem=<optimized out>, arg=<optimized out>)
    at dc/dc_mlx5_ep.c:1103
#10 0x0000100003100a14 in ucs_arbiter_dispatch_nonempty (
    arbiter=0x1000e332710, per_group=<optimized out>, 
    cb=0x1000211ed9e0 <uct_dc_mlx5_iface_dci_do_dcs_pending_tx>, cb_arg=0x0)
    at datastruct/arbiter.c:286
#11 0x00001000211f600c in ucs_arbiter_dispatch (per_group=1, cb_arg=0x0, cb=<optimized out>, arbiter=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/arbiter.h:356
#12 uct_dc_mlx5_iface_progress_pending (iface=<optimized out>)
    at dc/dc_mlx5_ep.h:309
#13 uct_dc_mlx5_poll_tx (iface=0x1000e329600) at dc/dc_mlx5.c:225
#14 uct_dc_mlx5_iface_progress (arg=0x1000e329600) at dc/dc_mlx5.c:238
#15 0x0000100003005904 in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#16 uct_worker_progress (worker=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/api/uct.h:2203
#17 ucp_worker_progress (worker=0x1000213b0010) at core/ucp_worker.c:1892
#18 0x0000100002cdfa9c in LibColl::Adapter::UCXContext::advance (
    this=0x1000e3aea00) at ../../adapter/ucx/UCXContext.h:334
#19 0x0000100002cc9318 in LIBCOLL_Advance (ctxt=<optimized out>)
    at libcoll.cc:161
#20 0x0000100002ef2748 in blocking_coll (context=0x1000e3aea00, 
    coll=0x3ffffa066e30, active=0x3ffffa078940) at ../collsel/init_util.h:130
#21 0x0000100002f0275c in measure_regular_collective (iters=1, 
    bench=<synthetic pointer>, colls=0x3ffffa066e30, context=0x1000e3aea00)
    at ../collsel/Benchmark.h:2229
#22 measure_collective (bench=<synthetic pointer>, colls=0x3ffffa066e30, 
    context=0x1000e3aea00) at ../collsel/Benchmark.h:2411
#23 LibColl::AdvisorTable::generate (this=0x1000e5cfdc0, 
    filename=0x1000e4baf40 "0.xml", params=<optimized out>, 
    ops=0x100002c017b0 <external_geometry_ops>, mode=<optimized out>)
    at ../collsel/AdvisorTable.h:999
#24 0x0000100002ef59f0 in LibColl::CollselExtension::Collsel_table_generate (
    advisor=<optimized out>, filename=0x1000e4baf40 "0.xml", 
    params=<optimized out>, ops=<optimized out>, mode=<optimized out>)
    at ../collsel/CollselExtension.cc:95
#25 0x0000100002ef5a48 in Collsel_table_generate (advisor=<optimized out>, 
    filename=<optimized out>, params=<optimized out>, ops=<optimized out>, 
    mode=<optimized out>) at CollselExtension.cc:26
#26 0x0000100002bdc4f0 in mca_coll_ibm_tune ()
   from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#27 0x0000100002bb4550 in mca_coll_ibm_hook_at_mpi_init_bottom ()
   from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/spectrum_mpi/mca_coll_ibm.so
#28 0x00001000001cb120 in ompi_hook_base_mpi_init_bottom ()
   from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#29 0x0000100000149ebc in ompi_mpi_init ()
   from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#30 0x000010000017f8f8 in PMPI_Init ()
   from /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/exports/optimized/lib/libmpi_ibm.so.3
#31 0x0000000010000888 in main ()
(gdb)

Looking in the UCX logs for 0x451d287067bdf51c shows that this is the msg_id of an active message which is in progress; it occurs in the data field of a 'reply' active message sent to the failing process. It is in the first active message which gets processed by the RDMA code path. With some more testing, the code doesn't fail when running 20 ranks on 1 node or 40 ranks on 2 nodes; but it does fail running 80 ranks on 4 nodes. It isn't always the first use of the RDMA code path which fails.

Adding trace to ucp_rma_basic_progress_put

    ucs_trace("req=%p rkey=%p ep=%p", req, rkey, ep) ;

shows that this value is in the 'rkey', i.e. req->send.rma.rkey . The req is from the mpool and has been used before, but I think it has been correctly allocated and freed, and I don't think my code is overwriting this value.

I've found the immediate cause of this problem; it's that my code has

    sptr = ucp_put_nb(ep, iovec[1].buffer,iovec[1].length,
        rdma_reply_hdr->address+iovec[0].length,unfinished->rkey,
        ucp_am_rdma_completion_callback) ;
    if (sptr == UCS_OK)
    {
..
    }
    else
    {
        ucs_trace("AM RDMA rdma issued, sptr=%p", sptr ) ;
        ucs_assert(!UCS_PTR_IS_ERR(sptr)) ;
        req2 = ((ucp_request_t *) sptr)-1 ;
        req2->send.am.message_id = unfinished->msg_id ;
    }

i.e. I'm writing into the 'send.am' field of a req, when the req is of an 'send.rma' type. I need to pass a msg_id (64-bit value) to the completion callback of ucp_put_nb; is there a suitable field in a 'req' to do this ?

I introduced a 'message_id' into the 'send.rma' field of the request structure; i.e. now I have

                struct {
                    uint64_t      remote_addr; /* Remote address */
                    ucp_rkey_h    rkey;     /* Remote memory key */
                    uint64_t      message_id ; /* Message ID of active message that this rma is working for */
                } rma;

and made appropriate changes in my code to use it; now the AM-over-RDMA seems to be working properly for 120 processes on 6 nodes. I don't think this extra field bloats the 'request' structure, since the rma struct is shorter than some of the other structs in the union. I will check again that ucp_put_nb is still broken for lengths less than 65536 bytes within a node, but I don't think I've changed anything relevant here so I presume it is still broken.

Now, with trace off and using RDMA for 65536 bytes and longer, I fail another assert

[1,34]<stderr>:[c712f6n02:107237:0:107237]      ucp_am.c:1378 Assertion `payload_data_last == unfinished->iovec_1_last_byte' failed

The implication of this failure is that a knem 'put' for 65536 bytes was partial; i.e. the first byte was transferred but the last byte was not transferred. It could alternatively be that the completion callback for the ucp_put_nb was driven 'early', i.e. while data was still being copied; in this case the application may update the source buffer, which would confuse the destination.

Yes, I still have the problem of knem copies of 4096 bytes failing.

I have a version of my code which does an RDMA 'get' rather than a 'put' in branch https://github.com/tjcw/ucx/tree/tjcw-am-rdma-get . I find that sometimes the callback for the ucp_get_nb is driven with code -80, UCS_ERR_ENDPOINT_TIMEOUT . @yosefe Why is this happening ? Is it possible for the application to recover ?

I'm doing some more testing, this time with 8 ranks on a single POWER9 node. Most of the time the test case passes, but occasionally it hangs or gets a UCS_ERR_ENDPOINT_TIMEOUT. I wasn't expecting the test case to use IB (since it is all on a single node). The last few lines of UCX trace on the failing rank are

[1569928070.563076] [f8n02:26330:0]       ucp_rkey.c:256  UCX  TRACE rkey[1] for remote md 6 is 0xffffffff000c1557
[1569928070.563078] [f8n02:26330:0]          ib_md.c:619  UCX  TRACE unpacked rkey 0xffffffff00001a32: direct 0x1a32 indirect 0xffffffff
[1569928070.563079] [f8n02:26330:0]       ucp_rkey.c:256  UCX  TRACE rkey[2] for remote md 7 is 0xffffffff00001a32
[1569928070.563080] [f8n02:26330:0]          ib_md.c:619  UCX  TRACE unpacked rkey 0xffffffff0004162a: direct 0x4162a indirect 0xffffffff
[1569928070.563081] [f8n02:26330:0]       ucp_rkey.c:256  UCX  TRACE rkey[3] for remote md 8 is 0xffffffff0004162a
[1569928070.563083] [f8n02:26330:0]       ucp_rkey.c:256  UCX  TRACE rkey[4] for remote md 10 is 0xdeadbeef
[1569928070.563084] [f8n02:26330:0]        knem_md.c:219  UCX  TRACE unpacked rkey: key 0x4ce71d60 cookie 0x8a7bd36c4805523e address 3ff96a80
[1569928070.563086] [f8n02:26330:0]       ucp_rkey.c:256  UCX  TRACE rkey[5] for remote md 11 is 0x4ce71d60
[1569928070.563087] [f8n02:26330:0]       ucp_rkey.c:487  UCX  TRACE rkey 0x4ce70470 ep 0x200022990150 @ cfg[2] basic_rma: lane[0] rkey 0xffffffff00080732 sw_amo: lane[3] rkey 0xffffffffffffffff
[1569928070.563089] [f8n02:26330:0]       rma_send.c:311  UCX  REQ   get_nb buffer 0x4d073550 length 524288 remote_addr 3ff96a80 rkey 0x4ce70470 from f8n02:26324 cb 0x200002fcbd50
[1569928070.563091] [f8n02:26330:0]       rma_send.c:184  UCX  REQ   allocated request 0x4ccc0780
[1569928070.563092] [f8n02:26330:0]         ucp_mm.c:124  UCX  TRACE registered address 0x4d073550 length 524288 on md[5] memh[0]=0x4c70eb78
[1569928070.563093] [f8n02:26330:0]    ucp_request.c:227  UCX  REQ   req 0x4ccc0780: mem reg md_map 0x20/0x20
[1569928070.563095] [f8n02:26330:0]          rma.inl:30   UCX  REQ   returning request 0x4ccc0780, status Success
[1569928087.197050] [f8n02:26330:0]    ib_mlx5_log.c:139  UCX  DEBUG Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[1569928087.197050] [f8n02:26330:0]    ib_mlx5_log.c:139  UCX  DEBUG RC QP 0x4370 wqe[268]: opcode RDMA_READ
[1569928087.197088] [f8n02:26330:0]    ucp_request.c:195  UCX  REQ   req 0x4ccc0780: mem dereg buffer 0/1 md_map 0x20
[1569928087.197090] [f8n02:26330:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x4c70eb78 from md[5]
[1569928087.197092] [f8n02:26330:0]  ucp_request.inl:96   UCX  REQ   completing send request 0x4ccc0780 (0x4ccc0868) ----c-- Endpoint timeout
[1569928087.197094] [f8n02:26330:0]  ucp_request.inl:98   UCX  REQ   Driving callback
[1569928087.197095] [f8n02:26330:0]         ucp_am.c:1258 UCX  TRACE AM RDMA ucp_am_rdma_get_completion_callback request=0x4ccc0868 req=0x4ccc0780 status=-80

and the backtrace is

#0  0x000020000061f568 in pause () from /lib64/libc.so.6
#1  0x00002000030eaae4 in ucs_debug_freeze () at debug/debug.c:710
#2  0x00002000030ee5f0 in ucs_error_freeze (message=0x7fffe8504610 "Assertion `status == UCS_OK' failed") at debug/debug.c:829
#3  ucs_handle_error (message=0x7fffe8504610 "Assertion `status == UCS_OK' failed") at debug/debug.c:992
#4  0x00002000030ea898 in ucs_fatal_error_message (file=0x200003039108 "core/ucp_am.c", line=<optimized out>, 
    function=<optimized out>, message_buf=0x7fffe8504610 "Assertion `status == UCS_OK' failed") at debug/assert.c:33
#5  0x00002000030eaa64 in ucs_fatal_error_format (file=0x200003039108 "core/ucp_am.c", line=<optimized out>, 
    function=0x200003038ee0 <__FUNCTION__.14321> "ucp_am_rdma_get_completion_callback", format=<optimized out>) at debug/assert.c:49
#6  0x0000200002fcc5d8 in ucp_am_rdma_get_completion_callback (request=0x4ccc0868, status=<optimized out>) at core/ucp_am.c:1259
#7  0x0000200002fedea8 in ucp_request_complete_send (status=UCS_ERR_ENDPOINT_TIMEOUT, req=0x4ccc0780)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucp/core/ucp_request.inl:98
#8  ucp_rma_request_zcopy_completion (self=0x4ccc0810, status=<optimized out>) at rma/rma_send.c:117
#9  0x0000200015cf776c in uct_invoke_completion (status=<optimized out>, comp=0x4ccc0810)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/base/uct_iface.h:646
#10 uct_rc_txqp_purge_outstanding (txqp=0x4ce076b0, status=<optimized out>, is_log=<optimized out>) at rc/base/rc_ep.c:329
#11 0x0000200015d120f0 in uct_rc_mlx5_ep_handle_failure (ep=0x4ce076a0, status=<optimized out>) at rc/accel/rc_mlx5_ep.c:969
#12 0x0000200015d13b78 in uct_rc_mlx5_iface_handle_failure (ib_iface=0x4c839f20, arg=0x200016887840, status=<optimized out>)
    at rc/accel/rc_mlx5_iface.c:202
#13 0x0000200015cf3a08 in uct_ib_mlx5_check_completion (iface=0x4c839f20, cq=<optimized out>, cqe=0x200016887840)
    at mlx5/ib_mlx5.c:340
#14 0x0000200015d15608 in uct_ib_mlx5_poll_cq (cq=0x4c8425f8, iface=0x4c839f20)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/ib/mlx5/ib_mlx5.inl:38
#15 uct_rc_mlx5_iface_poll_tx (iface=0x4c839f20) at rc/accel/rc_mlx5_iface.c:89
#16 uct_rc_mlx5_iface_progress (arg=0x4c839f20) at rc/accel/rc_mlx5_iface.c:124
#17 0x0000200002fe48d4 in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211
#18 uct_worker_progress (worker=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/api/uct.h:2203
#19 ucp_worker_progress (worker=0x200015fd0010) at core/ucp_worker.c:1892
#20 0x0000200002cbfa9c in LibColl::Adapter::UCXContext::advance (this=0x4cbaf600) at ../../adapter/ucx/UCXContext.h:334

Why is the transfer being done with IB ? I was expecting knem.

I have another similar failure, this time with 40 ranks on a single POWER9 node and UCX trace turned off. The backchain is

(gdb) where
#0  0x000020000061f568 in pause () from /lib64/libc.so.6
#1  0x000020000310b224 in ucs_debug_freeze () at debug/debug.c:710
#2  0x000020000310ed50 in ucs_error_freeze (
    message=0x7fffd763a8f0 "Assertion `status == UCS_OK' failed")
    at debug/debug.c:829
#3  ucs_handle_error (
    message=0x7fffd763a8f0 "Assertion `status == UCS_OK' failed")
    at debug/debug.c:992
#4  0x000020000310afd8 in ucs_fatal_error_message (
    file=0x200003049238 "core/ucp_am.c", line=<optimized out>, 
    function=<optimized out>, 
    message_buf=0x7fffd763a8f0 "Assertion `status == UCS_OK' failed")
    at debug/assert.c:33
#5  0x000020000310b1a4 in ucs_fatal_error_format (
    file=0x200003049238 "core/ucp_am.c", line=<optimized out>, 
    function=0x200003048fb0 <__FUNCTION__.14466> "ucp_am_rendezvous_get_completion_callback", format=<optimized out>) at debug/assert.c:49
#6  0x0000200002fdc940 in ucp_am_rendezvous_get_completion_callback (
    request=0x4e852668, status=<optimized out>) at core/ucp_am.c:1322
#7  0x0000200002ffdfe8 in ucp_request_complete_send (
    status=UCS_ERR_ENDPOINT_TIMEOUT, req=0x4e852580)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucp/core/ucp_request.inl:98
#8  ucp_rma_request_zcopy_completion (self=0x4e852610, status=<optimized out>)
    at rma/rma_send.c:117
#9  0x00002000276a79ac in uct_invoke_completion (status=<optimized out>, 
    comp=0x4e852610)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/base/uct_iface.h:646
#10 uct_rc_txqp_purge_outstanding (txqp=0x4e89eec0, status=<optimized out>, 
    is_log=<optimized out>) at rc/base/rc_ep.c:329
#11 0x00002000276c2110 in uct_rc_mlx5_ep_handle_failure (ep=0x4e89eeb0, 
    status=<optimized out>) at rc/accel/rc_mlx5_ep.c:969
#12 0x00002000276c3d18 in uct_rc_mlx5_iface_handle_failure (
    ib_iface=0x4e38d420, arg=0x20002b1feac0, status=<optimized out>)
    at rc/accel/rc_mlx5_iface.c:212
#13 0x00002000276a3c38 in uct_ib_mlx5_check_completion (iface=0x4e38d420, 
    cq=<optimized out>, cqe=0x20002b1feac0) at mlx5/ib_mlx5.c:340
#14 0x00002000276c57a8 in uct_ib_mlx5_poll_cq (cq=0x4e395af8, iface=0x4e38d420)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/ib/mlx5/ib_mlx5.inl:38
#15 uct_rc_mlx5_iface_poll_tx (iface=0x4e38d420) at rc/accel/rc_mlx5_iface.c:98
#16 uct_rc_mlx5_iface_progress (arg=0x4e38d420) at rc/accel/rc_mlx5_iface.c:133
#17 0x0000200002ff4a14 in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211

I have another error with 40 ranks on a single IBM POWER9 node, with ucx trace on. Trace leading up to the error is

[1570193296.810368] [f8n02:66989:0]         ucp_mm.c:124  UCX  TRACE registered address 0x20003fcd0080 length 5242880 on md[7] memh[3]=0x40ba2188
[1570193296.810370] [f8n02:66989:0]         ucp_mm.c:124  UCX  TRACE registered address 0x20003fcd0080 length 5242880 on md[8] memh[4]=0x40be3f48
[1570193296.810372] [f8n02:66989:0]         ucp_mm.c:124  UCX  TRACE registered address 0x20003fcd0080 length 5242880 on md[10] memh[5]=0xdeadbeef
[1570193296.811068] [f8n02:66989:0]         ucp_mm.c:124  UCX  TRACE registered address 0x20003fcd0080 length 5242880 on md[11] memh[6]=0x412326f0
[1570193296.811070] [f8n02:66989:0]         ucp_mm.c:341  UCX  DEBUG mapped buffer 0x20003fcd0080 length 5242880 memh 0x4138c870 md_map 0xde4
[1570193296.811072] [f8n02:66989:0]       ucp_rkey.c:200  UCX  TRACE ep 0x2000344c0f50: unpacking rkey with md_map 0xde4
[1570193296.811074] [f8n02:66989:0]          ib_md.c:631  UCX  TRACE unpacked rkey 0xffffffff0008a64d: direct 0x8a64d indirect 0xffffffff
[1570193296.811076] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[0] for remote md 5 is 0xffffffff0008a64d
[1570193296.811077] [f8n02:66989:0]          ib_md.c:631  UCX  TRACE unpacked rkey 0xffffffff000c4257: direct 0xc4257 indirect 0xffffffff
[1570193296.811079] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[1] for remote md 6 is 0xffffffff000c4257
[1570193296.811080] [f8n02:66989:0]          ib_md.c:631  UCX  TRACE unpacked rkey 0xffffffff0000bcce: direct 0xbcce indirect 0xffffffff
[1570193296.811082] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[2] for remote md 7 is 0xffffffff0000bcce
[1570193296.811083] [f8n02:66989:0]          ib_md.c:631  UCX  TRACE unpacked rkey 0xffffffff000448d4: direct 0x448d4 indirect 0xffffffff
[1570193296.811085] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[3] for remote md 8 is 0xffffffff000448d4
[1570193296.811086] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[4] for remote md 10 is 0xdeadbeef
[1570193296.811088] [f8n02:66989:0]        knem_md.c:219  UCX  TRACE unpacked rkey: key 0x412327e0 cookie 0x722707b4805523e address 2000437d0080
[1570193296.811089] [f8n02:66989:0]       ucp_rkey.c:257  UCX  TRACE rkey[5] for remote md 11 is 0x412327e0
[1570193296.811091] [f8n02:66989:0]       ucp_rkey.c:488  UCX  TRACE rkey 0x4144fb40 ep 0x2000344c0f50 @ cfg[2] basic_rma: lane[0] rkey 0xffffffff000c4257 sw_amo: lane[3] rkey 0xffffffffffffffff
[1570193296.811094] [f8n02:66989:0]       rma_send.c:311  UCX  REQ   get_nb buffer 0x20003fcd0080 length 5242880 remote_addr 2000437d0080 rkey 0x4144fb40 from f8n02:66971 cb 0x200002fdbea0
[1570193296.811095] [f8n02:66989:0]       rma_send.c:184  UCX  REQ   allocated request 0x410d6280
[1570193296.811097] [f8n02:66989:0]         ucp_mm.c:124  UCX  TRACE registered address 0x20003fcd0080 length 5242880 on md[6] memh[0]=0x40b603c8
[1570193296.811098] [f8n02:66989:0]    ucp_request.c:227  UCX  REQ   req 0x410d6280: mem reg md_map 0x40/0x40
[1570193296.811100] [f8n02:66989:0]          rma.inl:30   UCX  REQ   returning request 0x410d6280, status Success
[1570193296.820385] [f8n02:66989:0]    ib_mlx5_log.c:139  UCX  DEBUG Local protection on mlx5_3:1/IB (synd 0x4 vend 0x36 hw_synd 0/0)
[1570193296.820385] [f8n02:66989:0]    ib_mlx5_log.c:139  UCX  DEBUG RC QP 0x120d8 wqe[42]: opcode RDMA_READ
[1570193296.820412] [f8n02:66989:0]    ucp_request.c:195  UCX  REQ   req 0x410d6280: mem dereg buffer 0/1 md_map 0x40
[1570193296.820414] [f8n02:66989:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x40b603c8 from md[6]
[1570193296.820416] [f8n02:66989:0]  ucp_request.inl:96   UCX  REQ   completing send request 0x410d6280 (0x410d6368) ----c-- Input/output error
[1570193296.820418] [f8n02:66989:0]  ucp_request.inl:98   UCX  REQ   Driving callback
[1570193296.820420] [f8n02:66989:0]         ucp_am.c:1339 UCX  TRACE AM RENDEZVOUS ucp_am_rendezvous_get_completion_callback request=0x410d6368 req=0x410d6280 status=-3

and backtrace is

(gdb) where
#0  0x000020000061f568 in pause () from /lib64/libc.so.6
#1  0x000020000310b224 in ucs_debug_freeze () at debug/debug.c:710
#2  0x000020000310ed50 in ucs_error_freeze (
    message=0x7fffd2984370 "Assertion `status == UCS_OK' failed")
    at debug/debug.c:829
#3  ucs_handle_error (
    message=0x7fffd2984370 "Assertion `status == UCS_OK' failed")
    at debug/debug.c:992
#4  0x000020000310afd8 in ucs_fatal_error_message (
    file=0x200003049238 "core/ucp_am.c", line=<optimized out>, 
    function=<optimized out>, 
    message_buf=0x7fffd2984370 "Assertion `status == UCS_OK' failed")
    at debug/assert.c:33
#5  0x000020000310b1a4 in ucs_fatal_error_format (
    file=0x200003049238 "core/ucp_am.c", line=<optimized out>, 
    function=0x200003048fb0 <__FUNCTION__.14476> "ucp_am_rendezvous_get_completion_callback", format=<optimized out>) at debug/assert.c:49
#6  0x0000200002fdc900 in ucp_am_rendezvous_get_completion_callback (
    request=0x410d6368, status=<optimized out>) at core/ucp_am.c:1340
#7  0x0000200002ffdfe8 in ucp_request_complete_send (status=UCS_ERR_IO_ERROR, 
    req=0x410d6280)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucp/core/ucp_request.inl:98
---Type <return> to continue, or q <return> to quit---
#8  ucp_rma_request_zcopy_completion (self=0x410d6310, status=<optimized out>)
    at rma/rma_send.c:117
#9  0x00002000276a79ac in uct_invoke_completion (status=<optimized out>, 
    comp=0x410d6310)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/base/uct_iface.h:646
#10 uct_rc_txqp_purge_outstanding (txqp=0x414576d0, status=<optimized out>, 
    is_log=<optimized out>) at rc/base/rc_ep.c:329
#11 0x00002000276c2110 in uct_rc_mlx5_ep_handle_failure (ep=0x414576c0, 
    status=<optimized out>) at rc/accel/rc_mlx5_ep.c:969
#12 0x00002000276c3d18 in uct_rc_mlx5_iface_handle_failure (
    ib_iface=0x40d2df20, arg=0x20002b24e6c0, status=<optimized out>)
    at rc/accel/rc_mlx5_iface.c:212
#13 0x00002000276a3c38 in uct_ib_mlx5_check_completion (iface=0x40d2df20, 
    cq=<optimized out>, cqe=0x20002b24e6c0) at mlx5/ib_mlx5.c:340
#14 0x00002000276c57a8 in uct_ib_mlx5_poll_cq (cq=0x40d365f8, iface=0x40d2df20)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/uct/ib/mlx5/ib_mlx5.inl:38
#15 uct_rc_mlx5_iface_poll_tx (iface=0x40d2df20) at rc/accel/rc_mlx5_iface.c:98
#16 uct_rc_mlx5_iface_progress (arg=0x40d2df20) at rc/accel/rc_mlx5_iface.c:133
#17 0x0000200002ff4a14 in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /smpi_dev/tjcw/workspace/ibm-smpi-toucan/ompibase/opensrc/myucx/src/ucs/datastruct/callbackq.h:211

Again the transfer seems to be using IB, where I was expecting knem. This error is repeatable. The application performs various collectives over the 40 ranks, with various algorithms, and for increasing data sizes. The error occurs on an Allgather collective with Binomial algorithm where each rank has 262144 bytes to transfer; this results in RDMA 'get' requests for 5242880 bytes. Several ranks fail at the same place.

I'm doing some more testing, this time with 120 ranks on 6 nodes of POWER8. Now, I'm getting a hang, with or without trace. I think trace is being buffered, with the result that I can't see the trace immediately leading up to the hang. Is there a way of forcing all the trace lines out ? e.g. a way of putting a 'fflush' after each line of trace is written ?

I added a 'fflush' to flush the trace. Now I can see the hang reported here https://github.com/openucx/ucx/issues/4036#issuecomment-539997107 . One of the ranks does a ucp_get_nb for a 4-byte region on a different node. The return value is a pointer (not UCS_OK or an error indication), but the completion callback for this req is never driven. A relevant section of the trace on this rank is

[1570700219.437898] [c712f6n02:5417 :0]       rma_send.c:311  UCX  REQ   get_nb buffer 0x1000460bb00 length 4 remote_addr 3fffc4546788 rkey 0x100045e6080 from c712f6n06:16846 cb 0x100002fabef0
[1570700219.437903] [c712f6n02:5417 :0]       rma_send.c:184  UCX  REQ   allocated request 0x100044c0f00
[1570700219.437908] [c712f6n02:5417 :0]          rma.inl:30   UCX  REQ   returning request 0x100044c0f00, status Success
[1570700219.437913] [c712f6n02:5417 :0]         ucp_am.c:1235 UCX  TRACE AM RENDEZVOUS ucp_get_nb returns 0x100044c0fe8

I have adjusted my ucp so that active messages transferring less than 65536 bytes will flow over the old protocol. Now, with my 120-rank 6-node test case, I get the send callback for the ucp_am_send_nb driven with status=-3 (UCS_ERR_IO_ERROR). ucx trace shows the error occuring while ucx is wiring up connections on behalf of the application; the last several lines from the trace are

[1570702875.300909] [c712f6n06:133141:0]      wireup_ep.c:658  UCX  TRACE ep 0x3fff71380480: wireup ep 0x1001ac7cca0 is remote-connected
[1570702875.300915] [c712f6n06:133141:0]      wireup_ep.c:77   UCX  TRACE ep 0x3fff71380480: switching wireup_ep 0x1001ac7cca0 to ready state
[1570702875.300920] [c712f6n06:133141:0]      wireup_ep.c:376  UCX  DEBUG ep 0x3fff71380480: destroy wireup ep 0x1001ac7cca0
[1570702875.300965] [c712f6n06:133141:0]        mpool.inl:81   UCX  TRACE Returning obj=0x3fff54ffcfc8 to mpool=0x1001aae6fc8
[1570702875.300970] [c712f6n06:133141:0]        mpool.inl:82   UCX  TRACE Returning obj=0x3fff54ffcfc8 to mpool name=rc_send_desc
[1570702875.300976] [c712f6n06:133141:0]        mpool.inl:81   UCX  TRACE Returning obj=0x3fff54ffaf48 to mpool=0x1001aae6fc8
[1570702875.300980] [c712f6n06:133141:0]        mpool.inl:82   UCX  TRACE Returning obj=0x3fff54ffaf48 to mpool name=rc_send_desc
[1570702875.300986] [c712f6n06:133141:0]        mpool.inl:81   UCX  TRACE Returning obj=0x3fff54ff8ec8 to mpool=0x1001aae6fc8
[1570702875.300991] [c712f6n06:133141:0]        mpool.inl:82   UCX  TRACE Returning obj=0x3fff54ff8ec8 to mpool name=rc_send_desc
[1570702875.300996] [c712f6n06:133141:0]        mpool.inl:81   UCX  TRACE Returning obj=0x3fff54ff6e48 to mpool=0x1001aae6fc8
[1570702875.301001] [c712f6n06:133141:0]        mpool.inl:82   UCX  TRACE Returning obj=0x3fff54ff6e48 to mpool name=rc_send_desc
[1570702875.301006] [c712f6n06:133141:0]        mpool.inl:81   UCX  TRACE Returning obj=0x3fff54ff4dc8 to mpool=0x1001aae6fc8
[1570702875.301011] [c712f6n06:133141:0]        mpool.inl:82   UCX  TRACE Returning obj=0x3fff54ff4dc8 to mpool name=rc_send_desc
[1570702875.301637] [c712f6n06:133141:0]    ib_mlx5_log.c:139  UCX  DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1570702875.301637] [c712f6n06:133141:0]    ib_mlx5_log.c:139  UCX  DEBUG DCI QP 0xc3b9 wqe[0]: opcode SEND
[1570702875.301660] [c712f6n06:133141:0]    ucp_request.c:195  UCX  REQ   req 0x1001ac71a80: mem dereg buffer 0/2 md_map 0x40
[1570702875.301665] [c712f6n06:133141:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x1001a95dc28 from md[6]
[1570702875.301673] [c712f6n06:133141:0]    ucp_request.c:195  UCX  REQ   req 0x1001ac71a80: mem dereg buffer 1/2 md_map 0x40
[1570702875.301678] [c712f6n06:133141:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x1001a95dc28 from md[6]
[1570702875.301685] [c712f6n06:133141:0]  ucp_request.inl:96   UCX  REQ   completing send request 0x1001ac71a80 (0x1001ac71b68) ----c-- Input/output error
[1570702875.301690] [c712f6n06:133141:0]  ucp_request.inl:98   UCX  REQ   Driving callback

I think the application is trying to send the same data buffer to multiple ranks, i.e. there are several ucp_am_send_nb calls with iovecs which refer to the same areas of memory. Should ucp be able to handle this ?

openucx / ucx

ucp active messages take an excessive number of ucp_worker_progress calls #4036