ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
547 stars 375 forks source link

DAOS: crash in rxm_ep_send_common()/rxm_ep_msg_normal_send()/fi_send() #6743

Closed bfaccini closed 1 year ago

bfaccini commented 3 years ago

Running DAOS v1.2.0 with libfabric v1.12.0, we have triggered a crash/SEGV with the following back-trace :

#0  0x00007fb3963b437f in fi_send (context=<optimized out>, dest_addr=0, desc=0x7fb37838a7c0, len=124, buf=0x2aaaabebdda0, ep=0x0) at include/rdma/fi_endpoint.h:291
#1  rxm_ep_msg_normal_send (rxm_conn=0x7fb37b1356a0, context=<optimized out>, desc=0x7fb37838a7c0, pkt_size=124, tx_pkt=0x2aaaabebdda0) at prov/rxm/src/rxm_ep.c:1133
#2  rxm_ep_send_common (rxm_ep=rxm_ep@entry=0x7fb37804a7f0, rxm_conn=0x7fb37b1356a0, iov=iov@entry=0x48b0260, desc=desc@entry=0x48b0248, count=count@entry=1, context=context@entry=0x7fb3781a35c8, data=data@entry=0, flags=83886080, 
    tag=tag@entry=1, op=op@entry=1 '\001') at prov/rxm/src/rxm_ep.c:1606
#3  0x00007fb3963b4e55 in rxm_ep_tsend (ep_fid=0x7fb37804a7f0, buf=<optimized out>, len=<optimized out>, desc=0x7fb378048910, dest_addr=<optimized out>, tag=1, context=0x7fb3781a35c8) at prov/rxm/src/rxm_ep.c:2255
#4  0x00007fb39b7ae46a in fi_tsend (context=0x7fb3781a35c8, tag=<optimized out>, dest_addr=<optimized out>, desc=<optimized out>, len=<optimized out>, buf=<optimized out>, ep=<optimized out>) at /usr/include/rdma/fi_tagged.h:114
#5  na_ofi_cq_process_retries (context=0x7fb378042fd0) at /usr/src/debug/mercury-2.0.1rc1/src/na/na_ofi.c:3287
#6  na_ofi_progress (na_class=0x7fb37802bde0, context=0x7fb378042fd0, timeout=0) at /usr/src/debug/mercury-2.0.1rc1/src/na/na_ofi.c:5051
#7  0x00007fb39b7a5b89 in NA_Progress (na_class=<optimized out>, context=<optimized out>, timeout=<optimized out>) at /usr/src/debug/mercury-2.0.1rc1/src/na/na.c:1170
#8  0x00007fb39b9ce2d0 in hg_core_progress_na (na_class=0x7fb37802bde0, na_context=0x7fb378042fd0, timeout=0, progressed_ptr=progressed_ptr@entry=0x48b0900 "") at /usr/src/debug/mercury-2.0.1rc1/src/mercury_core.c:3894
#9  0x00007fb39b9d00df in hg_core_poll (progressed_ptr=<synthetic pointer>, timeout=<optimized out>, context=0x7fb3780487d0) at /usr/src/debug/mercury-2.0.1rc1/src/mercury_core.c:3836
#10 hg_core_progress (context=0x7fb3780487d0, timeout=0) at /usr/src/debug/mercury-2.0.1rc1/src/mercury_core.c:3691
#11 0x00007fb39b9d531b in HG_Core_progress (context=<optimized out>, timeout=<optimized out>) at /usr/src/debug/mercury-2.0.1rc1/src/mercury_core.c:5057
#12 0x00007fb39b9c7cf2 in HG_Progress (context=<optimized out>, timeout=<optimized out>) at /usr/src/debug/mercury-2.0.1rc1/src/mercury.c:2020
#13 0x00007fb39e47c061 in crt_hg_progress () from /lib64/libcart.so.4
#14 0x00007fb39e43f765 in crt_progress () from /lib64/libcart.so.4
#15 0x0000000000421c35 in dss_srv_handler ()
#16 0x00007fb39d3cac6a in ABTD_ythread_func_wrapper () from /lib64/libabt.so.1
#17 0x00007fb39d3cae11 in make_fcontext () from /lib64/libabt.so.1
#18 0x0000000000000000 in ?? ()

the direct cause of the SEGV is because "msg_ep" pointer/field of "struct rxm_conn" is unexpectedly NULL :

$14 = {handle = {cmap = 0x7fb37805b220, state = 0x4, key = 0x1380006a, remote_key = 0x100001, fi_addr = 0x135, peer = 0x0}, msg_ep = 0x0, inject_pkt = 0x0, inject_data_pkt = 0x0, tinject_pkt = 0x0, tinject_data_pkt = 0x0, 
  deferred_conn_entry = {next = 0x7fb37804a9e8, prev = 0x7fb37b165c78}, deferred_tx_queue = {next = 0x7fb37b135708, prev = 0x7fb37b135708}, sar_rx_msg_list = {next = 0x7fb37b135718, prev = 0x7fb37b135718}, sar_deferred_rx_msg_list = {
    next = 0x7fb37b135728, prev = 0x7fb37b135728}, rndv_tx_credits = 0x10}

when this seems finally correct for a cmap handle in RXM_CMAP_SHUTDOWN state !!...

Looks like PR-5424 may not have handled all cleanup needs.

bfaccini commented 3 years ago

BTW, looking at the source code, I wonder if the rxm_conn_close() call before rxm_flush_msg_cq() call in rxm_conn_handle_notify () is necessary since rxm_conn_close() will be later called from rxm_conn_free() :

   1160 static int rxm_conn_handle_notify(struct fi_eq_entry *eq_entry)
   1161 {
   1162         struct rxm_cmap *cmap;
   1163         struct rxm_cmap_handle *handle;
   1164 
   1165         FI_INFO(&rxm_prov, FI_LOG_EP_CTRL, "notify event %" PRIu64 "\n",
   1166                 eq_entry->data);
   1167 
   1168         if ((enum rxm_cmap_signal) eq_entry->data != RXM_CMAP_FREE)
   1169                 return -FI_EOTHER;
   1170 
   1171         handle = eq_entry->context;
   1172         assert(handle->state == RXM_CMAP_SHUTDOWN);
   1173         FI_DBG(&rxm_prov, FI_LOG_EP_CTRL, "freeing handle: %p\n", handle);
   1174         cmap = handle->cmap;
   1175 
   1176         rxm_conn_close(handle);  <<<<<<<<<<< ????
   1177 
   1178         // after closing the connection, we need to flush any dangling references to the
   1179         // handle from msg_cq entries that have not been cleaned up yet, otherwise we
   1180         // could run into problems during CQ cleanup.  these entries will be errored so
   1181         // keep reading through EAVAIL.
   1182         rxm_flush_msg_cq(cmap->ep);
   1183 
   1184         if (handle->peer) {
   1185                 dlist_remove(&handle->peer->entry);
   1186                 free(handle->peer);
   1187                 handle->peer = NULL;
   1188         } else {
   1189                 cmap->handles_av[handle->fi_addr] = NULL;
   1190         }
   1191         rxm_conn_free(handle);
   1192         return 0;
   1193 }
   1194 
"prov/rxm/src/rxm_conn.c" line 1176 of 1588 --74%-- col 2-9
shefty commented 3 years ago

rxm_conn_close() calls fi_close() on the msg endpoint. That ends up completing any outstanding operations (including posted receives) as canceled. After rxm_conn_close(), rxm_flush_msg_cq() should remove those completions, as indicaed in the comment. So, that part of the flow is needed.

bfaccini commented 3 years ago

Well, ok, so since you think my assumption was wrong, does anybody from libfabric team have an idea concerning what could wrong to generate this kind of crash/SEGV ?

shefty commented 3 years ago

The CM code in rxm is extremely complex. PR #6833 replaces the implementation and simplifies the code. Those changes aren't quite ready, and I haven't had time to debug the issues there yet. The locking in #6833 is much easier to follow, with proper reference counting on peer addresses. Honestly, I think continuing down that path and getting it in shape to merge is the ideal path.

There were several fixes applied at the tcp layer to handle shutdown better and avoid reporting completions for internal transfers up to the app. Those are already in tree.

nikhilnanal commented 2 years ago

The CM code in rxm has been significantly been modified. Does this issue still occur with the latest libfabric main(master) branch?

bfaccini commented 2 years ago

Do you know which libfabric version will (or already ?) integrate changes from #6833 ?

shefty commented 2 years ago

It is in v1.13.2rc1 and will be in v1.13.2 or v1.14.0 (November release).

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.