Closed soumagne closed 7 years ago
This is an issue with Verbs (and perhaps GNI) in which they use sockets to set up the connection and then switch to native communications once set up.
Is adding volatile to the CCI connection valid? I thought volatile was for memory that underlying hardware may change and was not intended to be used for threads changing memory.
I believe the correct fix is to change CCI to cache early receives and queue them until the ACCEPT is ready and then move them to the completion queue after the ACCEPT event.
Can you try this uncompiled and untested patch?
Like Scott said, volatile should not be used to provide any sort of concurrency guarantees, see discussion at http://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading. tldr; volatile is a weird keyword.
My guess is that volatile is being used here to prevent optimizing out the redundant !na_cci_addr->cci_addr check in the added while loop - could remove the hack by rearranging some code.
I've observed a couple errors that look like this:
NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697
# handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1
Is that related or something different? My line numbers probably differ from master as well, but that is the very first safety check at the top of handle_recv_unexpected(). I'm just trying to figure out if it aligns with this issue.
Yes, it is most likely related.
na_cci_addr->cci_addr is not set, which happens when the ACCEPT event is handled.
Regarding volatile, I agree, I said exactly the same thing to Xuezhao but just ignore that for now, this patch is not correct anyway, he only sent that so that we can figure out what the issue is.
Hi,
Thanks for looking into the problem!
Yes the:
NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697
# handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1
is exactly the race caused problem.
Per the volatile it is not really needed in our testing, you can ignored that. Volatile sometimes can be useful in the case of sharing variable without lock protection. And it is fine to avoid using it if causing confusion.
The patch i sent is only a fast temporary solution as we need a quick fix just for our next week's DAOS demo.
BTW, now the NA_CCI does busy polling can cause 100% CPU usage for the progress thread. (over IB verbs). Are you planning to optimise it? Maybe not good to ask at here as it is another topic...
On Jun 23, 2016, at 3:13 AM, Liu Xuezhao notifications@github.com wrote:
Hi,
Thanks for looking into the problem!
Yes the:
NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697 # handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1
is exactly the race caused problem.
Per the volatile it is not really needed in our testing, you can ignored that. Volatile sometimes can be useful in the case of sharing variable without lock protection. And it is fine to avoid using it if causing confusion.
That is abusing the semantics.
The patch i sent is only a fast temporary solution as we need a quick fix just for our next week's DAOS demo.
Did you try my CCI patch?
BTW, now the NA_CCI does busy polling can cause 100% CPU usage for the progress thread. (over IV verbs). Are you planning to optimise it? Maybe not good to ask at here as it is not another topic…
Is this after the above patch or related to a Mercury patch? Can you provide more info?
"Did you try my CCI patch?" Sorry which patch? I cannot find it on this page.
"Is this after the above patch or related to a Mercury patch? Can you provide more info?" It is not related with any extra patch. Shouldn't it be the behaviour of current mercury master? In na_cci_progress, it polls CCI by calling cci_get_event() which is unblocking and just return CCI_EAGAIN when no event, and continue the polling entill timed value reached. If the progress thread just repeatly call the progress with a timeout then will eat all the cycle, no?
Apparently, github does not like .patch as a file extension.
Re: the busy polling issue, you can find an experimental solution to this by using these branches of CCI and Mercury:
In my testing that is sufficient to prevent busy polling with both the verbs and sm transports in CCI when used with Mercury. The tcp transport in CCI has an orthogonal problem.
Just a word of warning, that Mercury fork/branch is just for experimental purposes; it combines work in progress from multiple pull requests that will probably be integrated in a different order at some point. That said, it works for me :)
Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.
I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt
Hi Carn,
Good to know about the busy polling info, now our platform is ready for next week's demo and it is better to not change it too much. I'll take a look at that branch later after the demo.
thx
Is this reproducible?
Did the patch fix the issue with early arriving messages?
On Jun 23, 2016, at 11:10 PM, Liu Xuezhao notifications@github.com wrote:
Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.
I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt
0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0
1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1
2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948
3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784
4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819
5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)
We mark the vep->rdma_channel as non-blocking, so this should not happen. Can you give more details about when the blocking occurs? Are you sure it is hung in the write?
On Jun 28, 2016, at 8:54 AM, Atchley, Scott atchleyes@ornl.gov wrote:
Is this reproducible?
Did the patch fix the issue with early arriving messages?
On Jun 23, 2016, at 11:10 PM, Liu Xuezhao notifications@github.com wrote:
Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.
I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt
0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0
1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1
2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948
3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784
4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819
5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)
Sorry for the late response, I just see this msg.
We can still observe the hang after updated to latest CCI which with your patch in this page merged. This hang seems to be a different problem compared to the original problem in this page.
The hang happens when client connect to server, at that moment the client-side back trace is: (gdb) bt
The server-side trace is: (gdb) bt
at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1886
I am not familiar with CCI internal, any findings/suspect from above info? Thanks.
The hang happens occasionally in our testing ( a daos internal test - daosbench, about 1/10 rate can hit the hang when using 8 daos servers and 16 clients, each server with 8 service thread each with one separate na_class) when the client connects with server, it calls "NA_Addr_lookup" and then NA_Trigger/NA_progress to wait the ACK of the connect request, but cannot success.
"We mark the vep->rdma_channel as non-blocking, so this should not happen" In the above back trace, i think it is not blocking in the write(), it is just because client repeatly call the NA_progress so show that trace. And the server-side, it also repeatly call the progress().
Ok.
Can you export CCI_DEBUG=conn,msg,info and run again?
Do you think CCI is losing a message and/or event?
Scott
On Jul 1, 2016, at 3:40 AM, Liu Xuezhao notifications@github.com wrote:
"We mark the vep->rdma_channel as non-blocking, so this should not happen" In the above back trace, i think it is not blocking in the write(), it is just because client repeatly call the NA_progress so show that trace.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
After enabling the CCI debug msg, I have not observed the same hang. After running several round testing (about 15~20 times), my login node rebooted but I think it is not related with CCI or mercury. Maybe caused by those too many messages need to be redirected by ompi (now we use an experimental version ompi, not a stable release).
But if turn off CCI_DEBUG commonly it can hit the hang within less round testing.
Yes I'd like to suspect it is a kind of msg/event lose but I cannot get the log to prove it now. The CCI debug msg seems changed the race condition which may cause the hang.
On Jul 1, 2016, at 1:31 PM, Liu Xuezhao notifications@github.com wrote:
After enabling the CCI debug msg, I have not observed the same hang. After running several round testing (about 15~20 times), my login node rebooted but I think it is not related with CCI or mercury. Maybe caused by those too many messages need to be redirected by ompi (now we use an experimental version ompi, not a stable release).
But if turn off CCI_DEBUG commonly it can hit the hang within less round testing.
Can you try with just CCI_DEBUG=msg?
Yes I'd like to suspect it is a kind of msg/event lose but I cannot get the log to prove it now. The CCI debug msg seems changed the race condition which may cause the hang.
When it hangs, does it hang during connection setup or after some communication (MSGs to RMA)?
Hi,
I changed CCI to write the log to a local file to avoid the rebooting problem maybe caused by ompi stdout/stderr redirection, now I can get a full log when the hang happens.
As the log is very large, I put it to "/scratch/log" on boro clustre, not sure if you can login boro (Jerome can login I think).
The hang happens when the client (log is 1.client.boro-51.cci.log) connect to server (verbs://192.168.1.37:22775). the server have 8 na_class with the listening address of 192.168.1.37:[22768, 22775]. The log contains all logs of about 7 round of testing, so maybe only the tail part is useful. But I keep all there. The CCI_DEBUG is "conn,msg,info"
"When it hangs, does it hang during connection setup or after some communication (MSGs to RMA)?" From the application view, it looks like hang during connection setup.
Hit the hang by another run, this time it hang when client connect to server verbs://192.168.1.38:22775.
I attached the shorted (keep the last part of the log) log here. From the last part log of client-side, it seems the connection is established as normal, but NA layer did not get the event. The full log is in "/scratch/log" on boro clustre.
3.client.boro-52.cci.log.short.txt 3.server.boro-38.cci.log.short.txt
On Jul 2, 2016, at 9:03 AM, Liu Xuezhao notifications@github.com wrote:
Hit the hang by another run, this time it hang when client connect to server verbs://192.168.1.38:22775.
I attached the shorted (keep the last part of the log) log here. From the last part log of client-side, it seems the connection is established as normal, but NA layer did not get the event. The full log is in "/scratch/log" on boro clustre.
3.client.boro-52.cci.log.short.txt 3.server.boro-38.cci.log.short.txt
On the client, I do not see any error messages. The last item is a connecting message.
On the server, I find:
cci:63768:ignoring RDMA_CM_EVENT_UNREACHABLE event
and then it marks closed ~112 connections.
This error should only happen on an “active” side of a connection (i.e. a client calling cci_connect()):
RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request. If this event is generated in response to a UD QP resolution request over InfiniBand, the event status field will contain an errno, if negative, or the status result carried in the IB CM SIDR REP message.
Can you give me more info about this test setup?
Scott
The "marks closed ~112 connections" seems is normal case that the client program finished executing or killed by ctrl+c. "cci:63768:ignoring RDMA_CM_EVENT_UNREACHABLE event" looks abnormal... I reproduced another time, it shows that the "ignoring RDMA_CM_EVENT_UNREACHABLE event" and "marking vconn 0x7fad58200520 closed" log msg only shows after I killed the client-side processes, that say the hang happen before that abnormal log msg. (I did a "sync" before killing the client and observe there is no those msg).
For the test setup, we have 8 daos_server processes, now the only scenario of server calling cci_connect is the rank 0 daos_server will send RPC to other servers to create the target pool. It happens at the very beginning of server startup, and the log is not in the shorted log. But the rank 0 is not boro-38 which is the node server log generated. Or say the boro-38 server did not call the cci_connect. And in the testing the client nodes are completely different nodes against server nodes.
One interesting thing is, other servers' node does not has the "ignoring RDMA_CM_EVENT_UNREACHABLE event" msg logged. And every time the hang happen, the server which the client cannot connect with will log that message once.
Can you try this patch. It does a few things:
I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)
Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.
The server-side segfault is: Program terminated with signal 11, Segmentation fault.
1820 na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt
(gdb) f 0
1820 na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id; (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0
The shorted log attached. 7.client.boro-52.cci.log.shorted.txt 7.server.boro-39.cci.log.shorted.txt
On Jul 3, 2016, at 9:26 PM, Liu Xuezhao notifications@github.com wrote:
I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)
Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.
Ok, I will add a handler for this as well. Patch coming soon.
The server-side segfault is: Program terminated with signal 11, Segmentation fault.
0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820
1820 na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt
0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820
1 na_cci_progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1909
2 0x00007efdefd67972 in NA_Progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495
3 0x00007efdeff852af in hg_core_progress_na (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926
4 hg_core_progress (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982
5 0x00007efdeff85441 in HG_Core_progress (context=0x7efdd01062e0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931
6 0x00007efdf0c2416a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977
7 0x00007efdf0c21eeb in dtp_progress (dtp_ctx=0x7efdd00048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229
8 0x0000000000403b90 in dss_srv_handler (arg=0x863770) at src/server/srv.c:146
9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0
10 0x000000386ace893d in clone () from /lib64/libc.so.6
(gdb) f 0
0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820
1820 na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id; (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0
Can you trigger this again and use the debugger to print the CCI event:
p *event
? The event->connect.context is not set. I am curious about the rest of the event.
The shorted log attached. 7.client.boro-52.cci.log.shorted.txt 7.server.boro-39.cci.log.shorted.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
On Jul 4, 2016, at 7:23 AM, Atchley, Scott atchleyes@ornl.gov wrote:
On Jul 3, 2016, at 9:26 PM, Liu Xuezhao notifications@github.com wrote:
I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)
Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.
Ok, I will add a handler for this as well. Patch coming soon.
Can you try to add:
case RDMA_CM_EVENT_CONNECT_ERROR:
just above:
case RDMA_CM_EVENT_UNREACHABLE:
so it just falls through?
Hi, changed as you suggested (add case RDMA_CM_EVENT_CONNECT_ERROR) and reproduced again, this time it the client boro-49 cannot connect with server boro-35 (verbs://192.168.1.35:22775)
server-side segfault: (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0 (gdb) p event $2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111, ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT, data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = { type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}
And client get a NA error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855
It looks like that the no route between those two nodes, but the ib level benchmark can work. And that error is only occasionally happen for the same test.
The log attached. 9.client.boro-49.cci.log.shorted.txt
Can you remove the last patch and apply this one?
diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..7ede350 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -220,7 +220,7 @@ typedef struct verbs_tx { void buffer; / registered send buffer _/ uint16t len; / length of buffer _/ int32_t rdmaslot; / slot when using RDMA or -1 */
TAILQ_ENTRY(verbs_tx) entry; /* hang on vep->idletxs, vdev->queued, vdev->pending / struct verbs_rma_op rmaop; / owning RMA if remote completion msg _/ struct verbs_tx_pool txpool; / owning tx pool / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..2feda3b 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -1759,6 +1759,8 @@ static int ctp_verbs_accept(cci_event_t * event, const void context)
tx = verbs_get_tx(ep); if (!tx) {
verbs_rx_t rx = NULL; void context = NULL;
conn = cm_evt->id->context; assert(conn);
verbs_queue_evt(ep, &rx->evt);
out: return ret; @@ -2940,6 +2950,40 @@ verbs_handle_disconnected(cci__ep_t * ep, struct rdma_cm_event *cm_evt) return ret; }
+static int +verbs_handle_conn_unreachable(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{
break;
default: debug(CCI_DB_CONN, "ignoring %s event", rdma_event_str(cm_evt->event)); @@ -3927,11 +3978,15 @@ static int ctp_verbs_return_event(cci_event_t * event) } }
ret = verbs_post_rx(ep, rx);
On Jul 4, 2016, at 9:49 AM, Liu Xuezhao notifications@github.com wrote:
Hi, changed as you suggested (add case RDMA_CM_EVENT_CONNECT_ERROR) and reproduced again, this time it the client boro-49 cannot connect with server boro-35 (verbs://192.168.1.35:22775)
server-side segfault: (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0 (gdb) p event $2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111, ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT, data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = { type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}
And client get a NA error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855
handle_connect(): connect to verbs://192.168.1.35:22775 failed with No route to host
It looks like that the no route between those two nodes, but the ib level benchmark can work. And that error is only occasionally happen for the same test.
The log attached. 9.client.boro-49.cci.log.shorted.txt
9.server.boro-35.cci.log.shorted.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for looking into that problem. After applying the patch, still got client hang and server segfault. This time the client boro-58 cannot connect with boro-39
client-side error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855
src/dtp/dtp_hg.c:46011:38:na_addr_lookup_cb() Return from callback with NA_PROTOCOL_ERROR error
server segfault:
Core was generated by `./daos_server -c 8'.
Program terminated with signal 11, Segmentation fault.
1841> na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
(gdb) f 0
1841> na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id;
(gdb) p na_cci_addr
$1 = (na_cci_addr_t ) 0x0
(gdb) p *event
$2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111,
ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT,
data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = {
type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}
The log attached, no special log related with the debug msg added in the patch. 10.client.boro-58.cci.log.txt 10.server.boro-39.cci.log.txt
Can you send the complete client and server logs? You can email them to me (atchleyes at ornl dot gov) if they are too big for the list. Please tar and compress them.
Also, in ctp_verbs_api.c in ctp_verbs_create_endpoint_at(), can you comment out:
ret = rdma_create_id(vep->rdma_channel, &vep->id_ud, ep, RDMA_PS_UDP);
if (ret == -1) {
ret = errno;
goto out;
}
and
ret = rdma_bind_addr(vep->id_ud, (struct sockaddr *)&vep->sin);
if (ret == -1) {
ret = errno;
goto out;
}
I want to rule out that we are receiving a stray UD message.
These are not currently used and are safe to remove.
I reproduced and sent the log to you through email, thanks.
Hi Xuezhao,
Can you remove all previous patches and try this one? I have tested that it works with two single-threaded hosts. This may not solve your issue, but it fixes how CCI cleans up the RDMA connection.
A process may initiate the tear-down by calling rdma_disconnect(). In this case, it should wait for a RDMA_CM_EVENT_DISCONNECTED event before destroying the QP, etc.
Alternately, a process may get a RDMA_CM_EVENT_DISCONNECTED event (because the peer called rdma_disconnect()). In this case, it then calls rdma_disconnect(), which does not generate a new RDMA_CM_EVENT_DISCONNECTED event, and the process can immediately destroy the QP, etc.
If you still hit a bug, please send me the full logs via email.
Scott
Thanks Scott, this time still fail.
The server got this error msg:
cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event
cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event
And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event
I'll send the full logs to you via email.
On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:
Thanks Scott, this time still fail.
The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event
We need to figure this one out. The manpage has:
RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.
Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?
cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event
I believe these are harmless:
RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.
And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event
And we need to understand this one:
RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
I will generate a patch to try to print more info from the CM event.
I'll send the full logs to you via email.
Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.
Also, 14.log only has:
src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com. src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com.
I cannot find any connection attempts to verbs://192.168.1.40:22797 in the client log. The log is not complete.
For future logs, please only use CCI_DEBUG=conn. This will greatly reduce the logs.
On Jul 7, 2016, at 7:18 AM, Atchley, Scott atchleyes@ornl.gov wrote:
On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:
Thanks Scott, this time still fail.
The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event
We need to figure this one out. The manpage has:
RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.
Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?
cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event
I believe these are harmless:
RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.
And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event
And we need to understand this one:
RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
I will generate a patch to try to print more info from the CM event.
I'll send the full logs to you via email.
Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.
On Jul 7, 2016, at 12:34 PM, Atchley, Scott atchleyes@ornl.gov wrote:
Also, 14.log only has:
src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com. src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com.
I cannot find any connection attempts to verbs://192.168.1.40:22797 in the client log. The log is not complete.
Argh, bad regex. It is right above the connect error message.
For future logs, please only use CCI_DEBUG=conn. This will greatly reduce the logs.
On Jul 7, 2016, at 7:18 AM, Atchley, Scott atchleyes@ornl.gov wrote:
On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:
Thanks Scott, this time still fail.
The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event
We need to figure this one out. The manpage has:
RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.
Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?
cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event
I believe these are harmless:
RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.
And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event
And we need to understand this one:
RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
I will generate a patch to try to print more info from the CM event.
I'll send the full logs to you via email.
Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.
While waiting for the cluster access, can you try this patch? You will need to remove the old patch first.
Also, only use CCI_DEBUG=conn. We do not need the other (msg and info) messages.
diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..3e4bf67 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -392,7 +392,9 @@ typedef struct verbs_conn { uint32t last; /* last slot used / uint32_t _slots; /* pointers to buffer headers to poll /
uint32_t cci_disconn :1; /* still needs cci_disconnect */
/* for RO connections when using both SendRecv and RDMA _/ uint16t seqno; / last seqno sent / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..3f06333 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -2034,6 +2034,7 @@ verbs_insert_conn(cci__conn_t conn) pthread_rwlock_unlock(&vep->conn_tree_lock); pthread_mutex_lock(&ep->lock); TAILQ_INSERT_TAIL(&vep->conns, vconn, entry);
vconn->cci_disconn = 1; pthread_mutex_unlock(&ep->lock); debug(CCI_DB_CONN, "%s [%s]: added conn %p qp_num %u", func, ep->uri, (void*)conn, vconn->qp_num); @@ -2345,20 +2346,23 @@ out:
static const char *verbs_conn_state_str(verbs_conn_state_t state);
-static int ctp_verbs_disconnect(cci_connection_t * connection) +static void verbs_destroy_conn(cci__ep_t ep, cci__conn_t conn) {
__func__, (void*) conn, (void*)vconn, vconn->qp_num, conn->uri);
}
pthread_rwlock_wrlock(&vep->conn_tree_lock); tdelete(&vconn->qp_num, &vep->conn_tree, verbs_compare_u32); @@ -2377,13 +2381,6 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->conn_req->ptr); }
if (vconn->rbuf) {
if (vconn->rmr) {
ret = ibv_dereg_mr(vconn->rmr);
@@ -2399,12 +2396,49 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->rbuf); }
free((void *)conn->uri);
free(vconn);
free(conn);
verbs_destroy_conn(ep, conn);
}
CCI_EXIT; return ret; }
+static void +verbs_handle_timewait_exit(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{
Actually, please try this one. It prints out some info from the failed connect event.
diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..3e4bf67 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -392,7 +392,9 @@ typedef struct verbs_conn { uint32t last; /* last slot used / uint32_t _slots; /* pointers to buffer headers to poll /
uint32_t cci_disconn :1; /* still needs cci_disconnect */
/* for RO connections when using both SendRecv and RDMA _/ uint16t seqno; / last seqno sent / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..aec4a61 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -2034,6 +2034,7 @@ verbs_insert_conn(cci__conn_t conn) pthread_rwlock_unlock(&vep->conn_tree_lock); pthread_mutex_lock(&ep->lock); TAILQ_INSERT_TAIL(&vep->conns, vconn, entry);
vconn->cci_disconn = 1; pthread_mutex_unlock(&ep->lock); debug(CCI_DB_CONN, "%s [%s]: added conn %p qp_num %u", func, ep->uri, (void*)conn, vconn->qp_num); @@ -2345,20 +2346,23 @@ out:
static const char *verbs_conn_state_str(verbs_conn_state_t state);
-static int ctp_verbs_disconnect(cci_connection_t * connection) +static void verbs_destroy_conn(cci__ep_t ep, cci__conn_t conn) {
__func__, (void*) conn, (void*)vconn, vconn->qp_num, conn->uri);
}
pthread_rwlock_wrlock(&vep->conn_tree_lock); tdelete(&vconn->qp_num, &vep->conn_tree, verbs_compare_u32); @@ -2377,13 +2381,6 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->conn_req->ptr); }
if (vconn->rbuf) {
if (vconn->rmr) {
ret = ibv_dereg_mr(vconn->rmr);
@@ -2399,12 +2396,49 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->rbuf); }
free((void *)conn->uri);
free(vconn);
free(conn);
verbs_destroy_conn(ep, conn);
}
CCI_EXIT; return ret; }
+static void +verbs_handle_timewait_exit(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{
"Have you verified this server can ping the other server using the ib0 interface?" Yes, the ping works and the ib level benchmark also works. So the hardware link should be fine.
Today the nodes are used by other person to test, I'll wait one day. Just to confirm that should remove all previous patches and only apply the last one patch above?
Thanks.
On Jul 8, 2016, at 10:53 AM, Liu Xuezhao notifications@github.com wrote:
"Have you verified this server can ping the other server using the ib0 interface?" Yes, the ping works and the ib level benchmark also works. So the hardware link should be fine.
Today the nodes are used by other person to test, I'll wait one day. Just to confirm that should remove all previous patches and only apply the last one patch above?
Yes, thanks for the update.
Xuezhao provided another set of logs offline. The client is still getting a RDMA_CM_EVENT_CONNECT_ERROR and the server is seeing RDMA_CM_EVENT_UNREACHABLE. The patch should have printed more information about the first error, but it did not.
Xuezhao will be travel and they have not been able to give me external access yet. I am working on a multi-threaded test app that will open N endpoints with T threads per endpoint and C concurrent connections per endpoint. It will send a few messages candor RMAs, then close the endpoint. It will then initiate a new connection to maintain C open connections per endpoint. It will run for 10,000 connections or a requested amount of time.
I hope that this can trigger the issue. If not, I will wait for external access.
Xuezhao reported the following issue (must investigate):
0001-handle-the-possible-race-in-connection-establish.txt