soumagne commented 8 years ago

Xuezhao reported the following issue (must investigate):

One problem we observed is that sometimes mercury reports such error and cause the RPC broken:
 NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1678  
 # handle_recv_unexpected(): host boro-47.boro.hpdd.intel.com, na_cci_addr->cci_addr: (nil), NA_CCI_PRIVATE_DATA(na_class)->uri: verbs://192.168.1.47:21090,  peer verbs://192.168.1.33:51218 refcnt 1
(the above msg is a little different with mercury master as I added some info)

It can be triggered when there are multiple client-side processes send RPC request to the server concurrently.

After some investigations, I think it is a race condition in the connection establish process: The server receive connect request from client, and do “cci_accept” in which will send reply to client. The client will get a “CCI_EVENT_CONNECT” event to notify the connecting done, and server will get a “CCI_EVENT_ACCEPT” event to notify the connection is established.

The problem is in some race condition, the client gets the “CCI_EVENT_CONNECT” first and then send RPC request (unexpected msg), but when the server got the unexpected msg it has not polled out the “CCI_EVENT_ACCEPT” event and firstly got another “CCI_EVENT_RECV” notification of the new unexpected msg arriving. Then when processing the unexpected msg the server treats it is invalid as the connection is not valid.

The attached simple patch is my fixing to that problem, I think that it is possibly not the best fix, but I cannot find out a better solution right now. And our testing can pass after that patch.

0001-handle-the-possible-race-in-connection-establish.txt

scottatchley commented 8 years ago

This is an issue with Verbs (and perhaps GNI) in which they use sockets to set up the connection and then switch to native communications once set up.

Is adding volatile to the CCI connection valid? I thought volatile was for memory that underlying hardware may change and was not intended to be used for threads changing memory.

I believe the correct fix is to change CCI to cache early receives and queue them until the ACCEPT is ready and then move them to the completion queue after the ACCEPT event.

scottatchley commented 8 years ago

Can you try this uncompiled and untested patch?

JohnPJenkins commented 8 years ago

Like Scott said, volatile should not be used to provide any sort of concurrency guarantees, see discussion at http://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading. tldr; volatile is a weird keyword.

My guess is that volatile is being used here to prevent optimizing out the redundant !na_cci_addr->cci_addr check in the added while loop - could remove the hack by rearranging some code.

carns commented 8 years ago

I've observed a couple errors that look like this:

NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697
 # handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1

Is that related or something different? My line numbers probably differ from master as well, but that is the very first safety check at the top of handle_recv_unexpected(). I'm just trying to figure out if it aligns with this issue.

scottatchley commented 8 years ago

Yes, it is most likely related.

na_cci_addr->cci_addr is not set, which happens when the ACCEPT event is handled.

soumagne commented 8 years ago

Regarding volatile, I agree, I said exactly the same thing to Xuezhao but just ignore that for now, this patch is not correct anyway, he only sent that so that we can figure out what the issue is.

liuxuezhao commented 8 years ago

Hi,

Thanks for looking into the problem!

Yes the:

NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697
 # handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1
is exactly the race caused problem.

Per the volatile it is not really needed in our testing, you can ignored that. Volatile sometimes can be useful in the case of sharing variable without lock protection. And it is fine to avoid using it if causing confusion.

The patch i sent is only a fast temporary solution as we need a quick fix just for our next week's DAOS demo.

BTW, now the NA_CCI does busy polling can cause 100% CPU usage for the progress thread. (over IB verbs). Are you planning to optimise it? Maybe not good to ask at here as it is another topic...

scottatchley commented 8 years ago

On Jun 23, 2016, at 3:13 AM, Liu Xuezhao notifications@github.com wrote:

Hi,

Thanks for looking into the problem!

Yes the:
NA: Error in /home/carns/working/src/sds/mercury/src/na/na_cci.c:1697
# handle_recv_unexpected(): peer verbs://172.23.100.194:44619 refcnt 1
is exactly the race caused problem.

Per the volatile it is not really needed in our testing, you can ignored that. Volatile sometimes can be useful in the case of sharing variable without lock protection. And it is fine to avoid using it if causing confusion.

That is abusing the semantics.

The patch i sent is only a fast temporary solution as we need a quick fix just for our next week's DAOS demo.

Did you try my CCI patch?

BTW, now the NA_CCI does busy polling can cause 100% CPU usage for the progress thread. (over IV verbs). Are you planning to optimise it? Maybe not good to ask at here as it is not another topic…

Is this after the above patch or related to a Mercury patch? Can you provide more info?

liuxuezhao commented 8 years ago

"Did you try my CCI patch?" Sorry which patch? I cannot find it on this page.

"Is this after the above patch or related to a Mercury patch? Can you provide more info?" It is not related with any extra patch. Shouldn't it be the behaviour of current mercury master? In na_cci_progress, it polls CCI by calling cci_get_event() which is unblocking and just return CCI_EAGAIN when no event, and continue the polling entill timed value reached. If the progress thread just repeatly call the progress with a timeout then will eat all the cycle, no?

scottatchley commented 8 years ago

Apparently, github does not like .patch as a file extension.

verbs_queue_early_recvs.patch.txt

carns commented 8 years ago

Re: the busy polling issue, you can find an experimental solution to this by using these branches of CCI and Mercury:

In my testing that is sufficient to prevent busy polling with both the verbs and sm transports in CCI when used with Mercury. The tcp transport in CCI has an orthogonal problem.

Just a word of warning, that Mercury fork/branch is just for experimental purposes; it combines work in progress from multiple pull requests that will probably be integrated in a different order at some point. That said, it works for me :)

liuxuezhao commented 8 years ago

Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.

I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

liuxuezhao commented 8 years ago

Hi Carn,

Good to know about the busy polling info, now our platform is ready for next week's demo and it is better to not change it too much. I'll take a look at that branch later after the demo.

thx

scottatchley commented 8 years ago

Is this reproducible?

Did the patch fix the issue with early arriving messages?

On Jun 23, 2016, at 11:10 PM, Liu Xuezhao notifications@github.com wrote:

Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.

I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

scottatchley commented 8 years ago

We mark the vep->rdma_channel as non-blocking, so this should not happen. Can you give more details about when the blocking occurs? Are you sure it is hung in the write?

On Jun 28, 2016, at 8:54 AM, Atchley, Scott atchleyes@ornl.gov wrote:

Is this reproducible?

Did the patch fix the issue with early arriving messages?

On Jun 23, 2016, at 11:10 PM, Liu Xuezhao notifications@github.com wrote:

Hi, thanks for providing the patch, I was thinking it might be good to fix in CCI lib.

I tested with the CCI patch just now, our daos IOR testing hang, one rank blocked in: (gdb) bt

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

liuxuezhao commented 8 years ago

Sorry for the late response, I just see this msg.

We can still observe the hang after updated to latest CCI which with your patch in this page merged. This hang seems to be a different problem compared to the original problem in this page.

The hang happens when client connect to server, at that moment the client-side back trace is: (gdb) bt

0 0x00000030a7adb57d in write () from /lib64/libc.so.6

1 0x00007f15a646de58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007f159fbe9133 in verbs_get_cm_event (ep=0x32070f0) at ctp_verbs_api.c:2974

3 0x00007f159fbec788 in verbs_progress_ep (ep=0x32070f0) at ctp_verbs_api.c:3810

4 0x00007f159fbed3a8 in ctp_verbs_get_event (endpoint=0x32070f8, event=0x7ffdad5fd038) at ctp_verbs_api.c:3845

5 0x00007f15b1d4b941 in na_cci_progress (na_class=0x2a7bcb0, context=0x41a32c0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1886

6 0x00007f15b1d40972 in NA_Progress (na_class=0x2a7bcb0, context=0x41a32c0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

7 0x00007f15b2aa4098 in dtp_na_addr_lookup_wait (na_class=0x2a7bcb0, name=0x7ffdad5fd200 "verbs://192.168.1.41:22775", addr=0x7ffdad5fd288) at src/dtp/dtp_hg.c:109

The server-side trace is: (gdb) bt

0 0x00007f65dcddb7c0 in verbs_get_rdma_msg_event (ep=0x7f65a0004c30) at ctp_verbs_api.c:3604

1 verbs_progress_ep (ep=0x7f65a0004c30) at ctp_verbs_api.c:3816

2 0x00007f65dcddc3a8 in ctp_verbs_get_event (endpoint=0x7f65a0004c38, event=0x7f65a9ffa748) at ctp_verbs_api.c:3845

3 0x00007f65df341941 in na_cci_progress (na_class=0x7f65a0004950, context=0x7f65a0105730, timeout=)

at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1886

4 0x00007f65df336972 in NA_Progress (na_class=0x7f65a0004950, context=0x7f65a0105730, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

5 0x00007f65df5542af in hg_core_progress_na (context=0x7f65a0105e60, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

6 hg_core_progress (context=0x7f65a0105e60, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

7 0x00007f65df554441 in HG_Core_progress (context=0x7f65a0105e60, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

I am not familiar with CCI internal, any findings/suspect from above info? Thanks.

liuxuezhao commented 8 years ago

The hang happens occasionally in our testing ( a daos internal test - daosbench, about 1/10 rate can hit the hang when using 8 daos servers and 16 clients, each server with 8 service thread each with one separate na_class) when the client connects with server, it calls "NA_Addr_lookup" and then NA_Trigger/NA_progress to wait the ACK of the connect request, but cannot success.

liuxuezhao commented 8 years ago

"We mark the vep->rdma_channel as non-blocking, so this should not happen" In the above back trace, i think it is not blocking in the write(), it is just because client repeatly call the NA_progress so show that trace. And the server-side, it also repeatly call the progress().

scottatchley commented 8 years ago

Ok.

Can you export CCI_DEBUG=conn,msg,info and run again?

Do you think CCI is losing a message and/or event?

Scott

On Jul 1, 2016, at 3:40 AM, Liu Xuezhao notifications@github.com wrote:

"We mark the vep->rdma_channel as non-blocking, so this should not happen" In the above back trace, i think it is not blocking in the write(), it is just because client repeatly call the NA_progress so show that trace.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

liuxuezhao commented 8 years ago

After enabling the CCI debug msg, I have not observed the same hang. After running several round testing (about 15~20 times), my login node rebooted but I think it is not related with CCI or mercury. Maybe caused by those too many messages need to be redirected by ompi (now we use an experimental version ompi, not a stable release).

But if turn off CCI_DEBUG commonly it can hit the hang within less round testing.

Yes I'd like to suspect it is a kind of msg/event lose but I cannot get the log to prove it now. The CCI debug msg seems changed the race condition which may cause the hang.

scottatchley commented 8 years ago

On Jul 1, 2016, at 1:31 PM, Liu Xuezhao notifications@github.com wrote:

After enabling the CCI debug msg, I have not observed the same hang. After running several round testing (about 15~20 times), my login node rebooted but I think it is not related with CCI or mercury. Maybe caused by those too many messages need to be redirected by ompi (now we use an experimental version ompi, not a stable release).

But if turn off CCI_DEBUG commonly it can hit the hang within less round testing.

Can you try with just CCI_DEBUG=msg?

Yes I'd like to suspect it is a kind of msg/event lose but I cannot get the log to prove it now. The CCI debug msg seems changed the race condition which may cause the hang.

When it hangs, does it hang during connection setup or after some communication (MSGs to RMA)?

liuxuezhao commented 8 years ago

Hi,

I changed CCI to write the log to a local file to avoid the rebooting problem maybe caused by ompi stdout/stderr redirection, now I can get a full log when the hang happens.

As the log is very large, I put it to "/scratch/log" on boro clustre, not sure if you can login boro (Jerome can login I think).

The hang happens when the client (log is 1.client.boro-51.cci.log) connect to server (verbs://192.168.1.37:22775). the server have 8 na_class with the listening address of 192.168.1.37:[22768, 22775]. The log contains all logs of about 7 round of testing, so maybe only the tail part is useful. But I keep all there. The CCI_DEBUG is "conn,msg,info"

"When it hangs, does it hang during connection setup or after some communication (MSGs to RMA)?" From the application view, it looks like hang during connection setup.

liuxuezhao commented 8 years ago

Hit the hang by another run, this time it hang when client connect to server verbs://192.168.1.38:22775.

I attached the shorted (keep the last part of the log) log here. From the last part log of client-side, it seems the connection is established as normal, but NA layer did not get the event. The full log is in "/scratch/log" on boro clustre.

3.client.boro-52.cci.log.short.txt 3.server.boro-38.cci.log.short.txt

scottatchley commented 8 years ago

On Jul 2, 2016, at 9:03 AM, Liu Xuezhao notifications@github.com wrote:

Hit the hang by another run, this time it hang when client connect to server verbs://192.168.1.38:22775.

I attached the shorted (keep the last part of the log) log here. From the last part log of client-side, it seems the connection is established as normal, but NA layer did not get the event. The full log is in "/scratch/log" on boro clustre.

3.client.boro-52.cci.log.short.txt 3.server.boro-38.cci.log.short.txt

On the client, I do not see any error messages. The last item is a connecting message.

On the server, I find:

cci:63768:ignoring RDMA_CM_EVENT_UNREACHABLE event

and then it marks closed ~112 connections.

This error should only happen on an “active” side of a connection (i.e. a client calling cci_connect()):

RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request. If this event is generated in response to a UD QP resolution request over InfiniBand, the event status field will contain an errno, if negative, or the status result carried in the IB CM SIDR REP message.

Can you give me more info about this test setup?

Scott

liuxuezhao commented 8 years ago

The "marks closed ~112 connections" seems is normal case that the client program finished executing or killed by ctrl+c. "cci:63768:ignoring RDMA_CM_EVENT_UNREACHABLE event" looks abnormal... I reproduced another time, it shows that the "ignoring RDMA_CM_EVENT_UNREACHABLE event" and "marking vconn 0x7fad58200520 closed" log msg only shows after I killed the client-side processes, that say the hang happen before that abnormal log msg. (I did a "sync" before killing the client and observe there is no those msg).

For the test setup, we have 8 daos_server processes, now the only scenario of server calling cci_connect is the rank 0 daos_server will send RPC to other servers to create the target pool. It happens at the very beginning of server startup, and the log is not in the shorted log. But the rank 0 is not boro-38 which is the node server log generated. Or say the boro-38 server did not call the cci_connect. And in the testing the client nodes are completely different nodes against server nodes.

One interesting thing is, other servers' node does not has the "ignoring RDMA_CM_EVENT_UNREACHABLE event" msg logged. And every time the hang happen, the server which the client cannot connect with will log that message once.

scottatchley commented 8 years ago

Can you try this patch. It does a few things:

Handles RDMA_CM_EVENT_UNREACHABLE
Fixes verbs_handle_rejected which uses a tx to generate the event, but ctp_verbs_return_event() expects an rx and will try to incorrectly post the tx.
Changes the status of a rejected connection attempt to ECONNREFUSED from ECONNRESET. verbs_handle_unreachable.patch.txt

liuxuezhao commented 8 years ago

I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)

Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.

The server-side segfault is: Program terminated with signal 11, Segmentation fault.

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1820 na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1 na_cci_progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1909

2 0x00007efdefd67972 in NA_Progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

3 0x00007efdeff852af in hg_core_progress_na (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007efdeff85441 in HG_Core_progress (context=0x7efdd01062e0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007efdf0c2416a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007efdf0c21eeb in dtp_progress (dtp_ctx=0x7efdd00048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x863770) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

(gdb) f 0

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1820 na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id; (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0

The shorted log attached. 7.client.boro-52.cci.log.shorted.txt 7.server.boro-39.cci.log.shorted.txt

scottatchley commented 8 years ago

On Jul 3, 2016, at 9:26 PM, Liu Xuezhao notifications@github.com wrote:

I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)

Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.

Ok, I will add a handler for this as well. Patch coming soon.

The server-side segfault is: Program terminated with signal 11, Segmentation fault.

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1820 na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1 na_cci_progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1909

2 0x00007efdefd67972 in NA_Progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

3 0x00007efdeff852af in hg_core_progress_na (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007efdeff85441 in HG_Core_progress (context=0x7efdd01062e0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007efdf0c2416a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007efdf0c21eeb in dtp_progress (dtp_ctx=0x7efdd00048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x863770) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

(gdb) f 0

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1820 na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id; (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0

Can you trigger this again and use the debugger to print the CCI event:

p *event

? The event->connect.context is not set. I am curious about the rest of the event.

The shorted log attached. 7.client.boro-52.cci.log.shorted.txt 7.server.boro-39.cci.log.shorted.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

scottatchley commented 8 years ago

On Jul 4, 2016, at 7:23 AM, Atchley, Scott atchleyes@ornl.gov wrote:

On Jul 3, 2016, at 9:26 PM, Liu Xuezhao notifications@github.com wrote:

I applied the patch and reproduce it. This time the client still hang (client boro-52 cannot connect with boro-39 -- verbs://192.168.1.39:22770)

Client got "cci:22686:ignoring RDMA_CM_EVENT_CONNECT_ERROR event" log.

Ok, I will add a handler for this as well. Patch coming soon.

Can you try to add:

case RDMA_CM_EVENT_CONNECT_ERROR:

just above:

case RDMA_CM_EVENT_UNREACHABLE:

so it just falls through?

liuxuezhao commented 8 years ago

Hi, changed as you suggested (add case RDMA_CM_EVENT_CONNECT_ERROR) and reproduced again, this time it the client boro-49 cannot connect with server boro-35 (verbs://192.168.1.35:22775)

server-side segfault: (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0 (gdb) p event $2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111, ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT, data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = { type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}

And client get a NA error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855

handle_connect(): connect to verbs://192.168.1.35:22775 failed with No route to host

It looks like that the no route between those two nodes, but the ib level benchmark can work. And that error is only occasionally happen for the same test.

The log attached. 9.client.boro-49.cci.log.shorted.txt

9.server.boro-35.cci.log.shorted.txt

scottatchley commented 8 years ago

Can you remove the last patch and apply this one?

diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..7ede350 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -220,7 +220,7 @@ typedef struct verbs_tx { void buffer; / registered send buffer _/ uint16t len; / length of buffer _/ int32_t rdmaslot; / slot when using RDMA or -1 */

TAILQ_ENTRY(verbs_tx) entry; /* hang on vep->idle_txs, vdev->queued,
TAILQ_ENTRY(verbs_tx) entry; /* hang on vep->idletxs, vdev->queued, vdev->pending / struct verbs_rma_op rmaop; / owning RMA if remote completion msg _/ struct verbs_tx_pool txpool; / owning tx pool / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..2feda3b 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -1759,6 +1759,8 @@ static int ctp_verbs_accept(cci_event_t * event, const void context)

tx = verbs_get_tx(ep); if (!tx) {
debug(CCI_DB_CONN, "%s: unable to get tx for conn_reply",
func); ret = CCI_ENOBUFS; goto out; } @@ -2054,27 +2056,35 @@ verbs_handle_conn_rejected(cci__ep_t * ep, struct rdma_cm_event cm_evt) int ret = CCI_SUCCESS; cci__conn_t conn = NULL; cci_connection_t *connection = NULL;
verbs_tx_t *tx = NULL;
verbs_conn_t *vconn = NULL;
verbs_rx_t rx = NULL; void context = NULL;

conn = cm_evt->id->context; assert(conn);
vconn = conn->priv; + connection = &conn->connection; context = connection->context;
tx = verbs_get_tx(ep);
if (!tx) {
/* TODO alloc a tx */
debug(CCI_DB_CONN, "%s: connection to %s was rejected (conn->state %s) (peer = %p peer->conect = %p)",
func, conn->uri, verbs_conn_state_str(vconn->state),
cm_evt->id, cm_evt->id ? cm_evt->id->context : NULL); +
rx = calloc(1, sizeof(*rx));
if (!rx) {
/* TODO alloc a rx */ ret = CCI_ENOMEM; goto out; }
tx->evt.event.type = CCI_EVENT_CONNECT;
tx->evt.event.connect.status = ECONNRESET;
tx->evt.event.connect.context = context;
tx->evt.event.connect.connection = NULL;
tx->evt.conn = conn;
verbs_queue_evt(ep, &tx->evt);
rx->evt.event.type = CCI_EVENT_CONNECT;
rx->evt.event.connect.status = ECONNREFUSED;
rx->evt.event.connect.context = context;
rx->evt.event.connect.connection = NULL;
rx->evt.conn = conn;
rx->offset = (uint32_t) -1; /* need to free event when returned */
verbs_queue_evt(ep, &rx->evt);

out: return ret; @@ -2940,6 +2950,40 @@ verbs_handle_disconnected(cci__ep_t * ep, struct rdma_cm_event *cm_evt) return ret; }

+static int +verbs_handle_conn_unreachable(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{

int ret = CCI_SUCCESS;
cci__conn_t *conn = NULL;
cci_connection_t *connection = NULL;
verbs_rx_t *rx = NULL;
void *context = NULL; +
conn = cm_evt->id->context;
assert(conn); +
connection = &conn->connection;
context = connection->context; +
rx = verbs_get_tx(ep);
if (!rx) {
debug(CCI_DB_CONN, "%s: unable to allocate rx - cannot handle unreachable",
func);
ret = CCI_ENOMEM;
goto out;
}
rx->evt.event.type = CCI_EVENT_CONNECT;
rx->evt.event.connect.status = EHOSTUNREACH;
rx->evt.event.connect.context = context;
rx->evt.event.connect.connection = NULL;
rx->evt.conn = conn;
rx->offset = (uint32_t) -1; /* need to free event when returned */
verbs_queue_evt(ep, &rx->evt); + +out:
return ret; +} + static int verbs_get_cm_event(cci__ep_t * ep) { int ret = CCI_EAGAIN; @@ -2986,6 +3030,13 @@ static int verbs_get_cm_event(cci__ep_t * ep) debug(CCI_DB_CONN, "%s: verbs_handle_conn_rejected()" "returned %s", func, strerror(ret)); break;
case RDMA_CM_EVENT_CONNECT_ERROR:
case RDMA_CM_EVENT_UNREACHABLE:
ret = verbs_handle_conn_unreachable(ep, cm_evt);
if (ret)
debug(CCI_DB_CONN, "%s: verbs_handle_conn_unreachable()"
"returned %s", func, strerror(ret));
```
break;
```
default: debug(CCI_DB_CONN, "ignoring %s event", rdma_event_str(cm_evt->event)); @@ -3927,11 +3978,15 @@ static int ctp_verbs_return_event(cci_event_t * event) } }
```
    ret = verbs_post_rx(ep, rx);
```
if (ret) {
ret = errno;
debug(CCI_DB_MSG, "%s: post_rx() returned %s",
func, strerror(ret));
if (rx->offset != (uint32_t) -1) {
ret = verbs_post_rx(ep, rx);
if (ret) {
ret = errno;
debug(CCI_DB_MSG, "%s: post_rx() returned %s",
func, strerror(ret));
}
} else {
free(rx); } } break;

On Jul 4, 2016, at 9:49 AM, Liu Xuezhao notifications@github.com wrote:

Hi, changed as you suggested (add case RDMA_CM_EVENT_CONNECT_ERROR) and reproduced again, this time it the client boro-49 cannot connect with server boro-35 (verbs://192.168.1.35:22775)

server-side segfault: (gdb) p na_cci_addr $1 = (na_cci_addr_t ) 0x0 (gdb) p event $2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111, ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT, data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = { type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}

And client get a NA error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855

handle_connect(): connect to verbs://192.168.1.35:22775 failed with No route to host

It looks like that the no route between those two nodes, but the ib level benchmark can work. And that error is only occasionally happen for the same test.

The log attached. 9.client.boro-49.cci.log.shorted.txt

9.server.boro-35.cci.log.shorted.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

liuxuezhao commented 8 years ago

Thanks for looking into that problem. After applying the patch, still got client hang and server segfault. This time the client boro-58 cannot connect with boro-39

client-side error msg: NA: Error in /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1855

handle_connect(): connect to verbs://192.168.1.39:22769 failed with No route to host

src/dtp/dtp_hg.c:46011:38:na_addr_lookup_cb() Return from callback with NA_PROTOCOL_ERROR error

server segfault:
Core was generated by `./daos_server -c 8'.
Program terminated with signal 11, Segmentation fault.

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

1841> na_cci_op_id_t *na_cci_op_id = na_cci_addr->na_cci_op_id;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 libcmocka-1.0.1-1.el6.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnes-1.1.4-2.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.1.el6_7.x86_64 libuuid-2.17.2-12.18.el6.x86_64 numactl-2.0.9-2.el6.x86_64 openssl-1.0.1e-42.el6_7.4.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

1 na_cci_progress (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1942

2 0x00007f233b1f3972 in NA_Progress (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1482

3 0x00007f233b4114af in hg_core_progress_na (context=0x7f23201062b0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7f23201062b0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007f233b411641 in HG_Core_progress (context=0x7f23201062b0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007f233c0b116a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007f233c0aeeeb in dtp_progress (dtp_ctx=0x7f23200048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x14b23d0) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

(gdb) f 0

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

1841> na_cci_op_id_t na_cci_op_id = na_cci_addr->na_cci_op_id;
(gdb) p na_cci_addr
$1 = (na_cci_addr_t ) 0x0
(gdb) p *event
$2 = {type = CCI_EVENT_CONNECT, send = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, connection = 0x0, context = 0x0}, recv = {type = CCI_EVENT_CONNECT, len = 111, ptr = 0x0, connection = 0x0}, connect = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, request = {type = CCI_EVENT_CONNECT, data_len = 111, data_ptr = 0x0, attribute = CCI_CONN_ATTR_RO}, accept = {type = CCI_EVENT_CONNECT, status = CCI_ECONNREFUSED, context = 0x0, connection = 0x0}, keepalive = { type = CCI_EVENT_CONNECT, connection = 0x0}, dev_failed = {type = CCI_EVENT_CONNECT, endpoint = 0x0}}

The log attached, no special log related with the debug msg added in the patch. 10.client.boro-58.cci.log.txt 10.server.boro-39.cci.log.txt

scottatchley commented 8 years ago

Can you send the complete client and server logs? You can email them to me (atchleyes at ornl dot gov) if they are too big for the list. Please tar and compress them.

scottatchley commented 8 years ago

Also, in ctp_verbs_api.c in ctp_verbs_create_endpoint_at(), can you comment out:

    ret = rdma_create_id(vep->rdma_channel, &vep->id_ud, ep, RDMA_PS_UDP);
    if (ret == -1) {
            ret = errno;
            goto out;
    }

and

    ret = rdma_bind_addr(vep->id_ud, (struct sockaddr *)&vep->sin);
    if (ret == -1) {
            ret = errno;
            goto out;
    }

I want to rule out that we are receiving a stray UD message.

These are not currently used and are safe to remove.

liuxuezhao commented 8 years ago

I reproduced and sent the log to you through email, thanks.

scottatchley commented 8 years ago

Hi Xuezhao,

Can you remove all previous patches and try this one? I have tested that it works with two single-threaded hosts. This may not solve your issue, but it fixes how CCI cleans up the RDMA connection.

A process may initiate the tear-down by calling rdma_disconnect(). In this case, it should wait for a RDMA_CM_EVENT_DISCONNECTED event before destroying the QP, etc.

Alternately, a process may get a RDMA_CM_EVENT_DISCONNECTED event (because the peer called rdma_disconnect()). In this case, it then calls rdma_disconnect(), which does not generate a new RDMA_CM_EVENT_DISCONNECTED event, and the process can immediately destroy the QP, etc.

If you still hit a bug, please send me the full logs via email.

Scott

verbs_proper_cleanup.patch.txt

liuxuezhao commented 8 years ago

Thanks Scott, this time still fail.

The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event
cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event

And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event

I'll send the full logs to you via email.

scottatchley commented 8 years ago

On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:

Thanks Scott, this time still fail.

The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event

We need to figure this one out. The manpage has:

RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.

Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?

cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event

I believe these are harmless:

RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event

And we need to understand this one:

RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.

I will generate a patch to try to print more info from the CM event.

I'll send the full logs to you via email.

Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.

scottatchley commented 8 years ago

Also, 14.log only has:

src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com. src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com.

I cannot find any connection attempts to verbs://192.168.1.40:22797 in the client log. The log is not complete.

For future logs, please only use CCI_DEBUG=conn. This will greatly reduce the logs.

On Jul 7, 2016, at 7:18 AM, Atchley, Scott atchleyes@ornl.gov wrote:

On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:

Thanks Scott, this time still fail.

The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event

We need to figure this one out. The manpage has:

RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.

Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?

cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event

I believe these are harmless:

RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event

And we need to understand this one:

RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.

I will generate a patch to try to print more info from the CM event.

I'll send the full logs to you via email.

Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.

scottatchley commented 8 years ago

On Jul 7, 2016, at 12:34 PM, Atchley, Scott atchleyes@ornl.gov wrote:

Also, 14.log only has:

src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com. src/dtp/dtp_hg.c:195507:120:dtp_na_addr_lookup_wait() =======================Could not connect to verbs://192.168.1.40:22797 within 20 second, myrank 0, hostname boro-44.boro.hpdd.intel.com.

I cannot find any connection attempts to verbs://192.168.1.40:22797 in the client log. The log is not complete.

Argh, bad regex. It is right above the connect error message.

For future logs, please only use CCI_DEBUG=conn. This will greatly reduce the logs.

On Jul 7, 2016, at 7:18 AM, Atchley, Scott atchleyes@ornl.gov wrote:

On Jul 6, 2016, at 9:49 PM, Liu Xuezhao notifications@github.com wrote:

Thanks Scott, this time still fail.

The server got this error msg: cci:72466:ignoring RDMA_CM_EVENT_UNREACHABLE event

We need to figure this one out. The manpage has:

RDMA_CM_EVENT_UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.

Assuming the targeted server is reachable, we need to understand why it is not accepting the connection. Have you verified this server can ping the other server using the ib0 interface?

cci:72466:ignoring RDMA_CM_EVENT_TIMEWAIT_EXIT event

I believe these are harmless:

RDMA_CM_EVENT_TIMEWAIT_EXIT The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

And client got: cci:195507:ignoring RDMA_CM_EVENT_CONNECT_ERROR event

And we need to understand this one:

RDMA_CM_EVENT_CONNECT_ERROR Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.

I will generate a patch to try to print more info from the CM event.

I'll send the full logs to you via email.

Did either the client or server segfault or simply failed to connect? The 14.log does not show a stack trace.

scottatchley commented 8 years ago

While waiting for the cluster access, can you try this patch? You will need to remove the old patch first.

Also, only use CCI_DEBUG=conn. We do not need the other (msg and info) messages.

diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..3e4bf67 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -392,7 +392,9 @@ typedef struct verbs_conn { uint32t last; /* last slot used / uint32_t _slots; /* pointers to buffer headers to poll /

int is_polling; /* polling RDMA MSGs */
uint32_t is_polling :1; /* polling RDMA MSGs */
uint32_t rdma_disconn :1; /* still needs rdma_disconnect */
uint32_t cci_disconn :1; /* still needs cci_disconnect */

/* for RO connections when using both SendRecv and RDMA _/ uint16t seqno; / last seqno sent / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..3f06333 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -2034,6 +2034,7 @@ verbs_insert_conn(cci__conn_t conn) pthread_rwlock_unlock(&vep->conn_tree_lock); pthread_mutex_lock(&ep->lock); TAILQ_INSERT_TAIL(&vep->conns, vconn, entry);
vconn->cci_disconn = 1; pthread_mutex_unlock(&ep->lock); debug(CCI_DB_CONN, "%s [%s]: added conn %p qp_num %u", func, ep->uri, (void*)conn, vconn->qp_num); @@ -2345,20 +2346,23 @@ out:

static const char *verbs_conn_state_str(verbs_conn_state_t state);

-static int ctp_verbs_disconnect(cci_connection_t * connection) +static void verbs_destroy_conn(cci__ep_t ep, cci__conn_t conn) {

int ret = CCI_SUCCESS;
cciconn_t *conn = container_of(connection, cciconn_t, connection);
verbs_conn_t *vconn = conn->priv;
cci_endpoint_t *endpoint = connection->endpoint;
cciep_t *ep = container_of(endpoint, cciep_t, endpoint);
int ret = 0; verbs_ep_t *vep = ep->priv;
verbs_conn_t *vconn = conn->priv;
CCI_ENTER;
debug(CCI_DB_CONN, "%s: conn %p vconn %p qp_num %u (%s)",

__func__, (void*) conn, (void*)vconn, vconn->qp_num, conn->uri);

debug(CCI_DB_CONN, "%s [%s]: conn %p state %s qp_num %u", func,
ep->uri, (void*)conn, verbs_conn_state_str(vconn->state),
vconn->qp_num);
if (vconn->rdma_disconn) {
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
}

pthread_rwlock_wrlock(&vep->conn_tree_lock); tdelete(&vconn->qp_num, &vep->conn_tree, verbs_compare_u32); @@ -2377,13 +2381,6 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->conn_req->ptr); }
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));

- }

if (vconn->rbuf) {
    if (vconn->rmr) {
        ret = ibv_dereg_mr(vconn->rmr);

@@ -2399,12 +2396,49 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->rbuf); }

- rdma_destroy_ep(vconn->id);

free((void *)conn->uri);
free(vconn);
free(conn);

return; +} + +static int ctp_verbs_disconnect(cci_connection_t * connection) +{
int ret = CCI_SUCCESS;
cciconn_t *conn = container_of(connection, cciconn_t, connection);
verbs_conn_t *vconn = conn->priv;
cci_endpoint_t *endpoint = connection->endpoint;
cciep_t *ep = container_of(endpoint, cciep_t, endpoint);
verbs_ep_t *vep = ep->priv;
int rdma_disconn = 0; +
CCI_ENTER; +
debug(CCI_DB_CONN, "%s [%s]: conn %p state %s qp_num %u", func,
ep->uri, (void*)conn, verbs_conn_state_str(vconn->state),
vconn->qp_num); +
vconn->cci_disconn = 0; +
if (vconn->rdma_disconn) {
/* The conn is established and we are the one trying to
* tear it down.
*/
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
vconn->rdma_disconn = 0;
} else {
/* We have already received the RDMA_CM_EVENT_DISCONNECTED
* event. Go ahead and cleanup.
*/
verbs_destroy_conn(ep, conn);
} + CCI_EXIT; return ret; } @@ -2891,6 +2925,13 @@ verbs_handle_conn_established(cci__ep_t * ep, struct rdma_cm_event *cm_evt) vconn = conn->priv; assert(vconn);
/* The RDMA connection is established. We will need to call
* rdma_disconnect() in the future.
*/
pthread_mutex_lock(&ep->lock);
vconn->rdma_disconn = 1;
pthread_mutex_unlock(&ep->lock); + switch (vconn->state) { case VERBS_CONN_ACTIVE: ret = verbs_conn_est_active(ep, cm_evt); @@ -2924,22 +2965,52 @@ verbs_handle_disconnected(cci__ep_t * ep, struct rdma_cm_event *cm_evt) vconn = conn->priv; assert(vconn);
switch (vconn->state) {
case VERBS_CONN_ESTABLISHED:
vconn->state = VERBS_CONN_CLOSED;
debug(CCI_DB_CONN, "%s: marking vconn %p closed (%s)",
func, (void*)vconn, conn->uri);
break;
default:
debug(CCI_DB_INFO, "%s: incorrect conn state %s", func,
verbs_conn_state_str(vconn->state));
break;
if (vconn->rdma_disconn) {
/* We did not initiate the disconnect. We will not get
* another RDMA_CM_EVENT_DISCONNECTED, so go ahead and
* call it now.
*/
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
vconn->rdma_disconn = 0;
} +
/* Either way, we got the DISCONNECTED event, it is safe to cleanup
* the QP and CM id.
*/
ret = rdma_destroy_ep(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_destroy_ep() returned %s",
func, strerror(ret));
} +
if (!vconn->cci_disconn) {
```
verbs_destroy_conn(ep, conn);
```
}

CCI_EXIT; return ret; }

+static void +verbs_handle_timewait_exit(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{

/* The CM can reuse the QP now. We should have already torn down the
* QP, destroyed the RDMA id, and freed the CCI conn. Simply
* print the QP number and the possibly freed CCI conn address
* for informational purposes.
*/
debug(CCI_DB_CONN, "%s: qp_num %u context (conn) %p", func,
cm_evt->id->qp->qp_num, cm_evt->id->context); +
return; +} + static int verbs_get_cm_event(cci__ep_t * ep) { int ret = CCI_EAGAIN; @@ -2986,6 +3057,9 @@ static int verbs_get_cm_event(cci__ep_t * ep) debug(CCI_DB_CONN, "%s: verbs_handle_conn_rejected()" "returned %s", func, strerror(ret)); break;
case RDMA_CM_EVENT_TIMEWAIT_EXIT:
verbs_handle_timewait_exit(ep, cm_evt);
break; default: debug(CCI_DB_CONN, "ignoring %s event", rdma_event_str(cm_evt->event));

scottatchley commented 8 years ago

Actually, please try this one. It prints out some info from the failed connect event.

diff --git a/src/plugins/ctp/verbs/ctp_verbs.h b/src/plugins/ctp/verbs/ctp_verbs.h index c4fa95b..3e4bf67 100644 --- a/src/plugins/ctp/verbs/ctp_verbs.h +++ b/src/plugins/ctp/verbs/ctp_verbs.h @@ -392,7 +392,9 @@ typedef struct verbs_conn { uint32t last; /* last slot used / uint32_t _slots; /* pointers to buffer headers to poll /

int is_polling; /* polling RDMA MSGs */
uint32_t is_polling :1; /* polling RDMA MSGs */
uint32_t rdma_disconn :1; /* still needs rdma_disconnect */
uint32_t cci_disconn :1; /* still needs cci_disconnect */

/* for RO connections when using both SendRecv and RDMA _/ uint16t seqno; / last seqno sent / diff --git a/src/plugins/ctp/verbs/ctp_verbs_api.c b/src/plugins/ctp/verbs/ctp_verbs_api.c index 5a29ac3..aec4a61 100644 --- a/src/plugins/ctp/verbs/ctp_verbs_api.c +++ b/src/plugins/ctp/verbs/ctp_verbs_api.c @@ -2034,6 +2034,7 @@ verbs_insert_conn(cci__conn_t conn) pthread_rwlock_unlock(&vep->conn_tree_lock); pthread_mutex_lock(&ep->lock); TAILQ_INSERT_TAIL(&vep->conns, vconn, entry);
vconn->cci_disconn = 1; pthread_mutex_unlock(&ep->lock); debug(CCI_DB_CONN, "%s [%s]: added conn %p qp_num %u", func, ep->uri, (void*)conn, vconn->qp_num); @@ -2345,20 +2346,23 @@ out:

static const char *verbs_conn_state_str(verbs_conn_state_t state);

-static int ctp_verbs_disconnect(cci_connection_t * connection) +static void verbs_destroy_conn(cci__ep_t ep, cci__conn_t conn) {

int ret = CCI_SUCCESS;
cciconn_t *conn = container_of(connection, cciconn_t, connection);
verbs_conn_t *vconn = conn->priv;
cci_endpoint_t *endpoint = connection->endpoint;
cciep_t *ep = container_of(endpoint, cciep_t, endpoint);
int ret = 0; verbs_ep_t *vep = ep->priv;
verbs_conn_t *vconn = conn->priv;
CCI_ENTER;
debug(CCI_DB_CONN, "%s: conn %p vconn %p qp_num %u (%s)",

__func__, (void*) conn, (void*)vconn, vconn->qp_num, conn->uri);

debug(CCI_DB_CONN, "%s [%s]: conn %p state %s qp_num %u", func,
ep->uri, (void*)conn, verbs_conn_state_str(vconn->state),
vconn->qp_num);
if (vconn->rdma_disconn) {
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
}

pthread_rwlock_wrlock(&vep->conn_tree_lock); tdelete(&vconn->qp_num, &vep->conn_tree, verbs_compare_u32); @@ -2377,13 +2381,6 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->conn_req->ptr); }
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));

- }

if (vconn->rbuf) {
    if (vconn->rmr) {
        ret = ibv_dereg_mr(vconn->rmr);

@@ -2399,12 +2396,49 @@ static int ctp_verbs_disconnect(cci_connection_t * connection) free(vconn->rbuf); }

- rdma_destroy_ep(vconn->id);

free((void *)conn->uri);
free(vconn);
free(conn);

return; +} + +static int ctp_verbs_disconnect(cci_connection_t * connection) +{
int ret = CCI_SUCCESS;
cciconn_t *conn = container_of(connection, cciconn_t, connection);
verbs_conn_t *vconn = conn->priv;
cci_endpoint_t *endpoint = connection->endpoint;
cciep_t *ep = container_of(endpoint, cciep_t, endpoint);
verbs_ep_t *vep = ep->priv;
int rdma_disconn = 0; +
CCI_ENTER; +
debug(CCI_DB_CONN, "%s [%s]: conn %p state %s qp_num %u", func,
ep->uri, (void*)conn, verbs_conn_state_str(vconn->state),
vconn->qp_num); +
vconn->cci_disconn = 0; +
if (vconn->rdma_disconn) {
/* The conn is established and we are the one trying to
* tear it down.
*/
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
vconn->rdma_disconn = 0;
} else {
/* We have already received the RDMA_CM_EVENT_DISCONNECTED
* event. Go ahead and cleanup.
*/
verbs_destroy_conn(ep, conn);
} + CCI_EXIT; return ret; } @@ -2891,6 +2925,13 @@ verbs_handle_conn_established(cci__ep_t * ep, struct rdma_cm_event *cm_evt) vconn = conn->priv; assert(vconn);
/* The RDMA connection is established. We will need to call
* rdma_disconnect() in the future.
*/
pthread_mutex_lock(&ep->lock);
vconn->rdma_disconn = 1;
pthread_mutex_unlock(&ep->lock); + switch (vconn->state) { case VERBS_CONN_ACTIVE: ret = verbs_conn_est_active(ep, cm_evt); @@ -2924,22 +2965,61 @@ verbs_handle_disconnected(cci__ep_t * ep, struct rdma_cm_event *cm_evt) vconn = conn->priv; assert(vconn);
switch (vconn->state) {
case VERBS_CONN_ESTABLISHED:
vconn->state = VERBS_CONN_CLOSED;
debug(CCI_DB_CONN, "%s: marking vconn %p closed (%s)",
func, (void*)vconn, conn->uri);
break;
default:
debug(CCI_DB_INFO, "%s: incorrect conn state %s", func,
verbs_conn_state_str(vconn->state));
break;
if (vconn->rdma_disconn) {
/* We did not initiate the disconnect. We will not get
* another RDMA_CM_EVENT_DISCONNECTED, so go ahead and
* call it now.
*/
ret = rdma_disconnect(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_disconnect() returned %s",
func, strerror(ret));
}
vconn->rdma_disconn = 0;
} +
/* Either way, we got the DISCONNECTED event, it is safe to cleanup
* the QP and CM id.
*/
ret = rdma_destroy_ep(vconn->id);
if (ret == -1) {
ret = errno;
debug(CCI_DB_WARN, "%s: rdma_destroy_ep() returned %s",
func, strerror(ret));
} +
if (!vconn->cci_disconn) {
```
verbs_destroy_conn(ep, conn);
```
}

CCI_EXIT; return ret; }

+static void +verbs_handle_timewait_exit(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{

/* The CM can reuse the QP now. We should have already torn down the
* QP, destroyed the RDMA id, and freed the CCI conn. Simply
* print the QP number and the possibly freed CCI conn address
* for informational purposes.
*/
debug(CCI_DB_CONN, "%s: qp_num %u context (conn) %p", func,
cm_evt->id->qp->qp_num, cm_evt->id->context); +
return; +} + +static void +verbs_handle_connect_error(cci__ep_t * ep, struct rdma_cm_event *cm_evt) +{
debug(CCI_DB_WARN, "%s: qp_num %u context (conn) %p", func,
cm_evt->id->qp->qp_num, cm_evt->id->context); +
return; +} + static int verbs_get_cm_event(cci__ep_t * ep) { int ret = CCI_EAGAIN; @@ -2986,6 +3066,12 @@ static int verbs_get_cm_event(cci__ep_t * ep) debug(CCI_DB_CONN, "%s: verbs_handle_conn_rejected()" "returned %s", func, strerror(ret)); break;
case RDMA_CM_EVENT_TIMEWAIT_EXIT:
verbs_handle_timewait_exit(ep, cm_evt);
break;
case RDMA_CM_EVENT_CONNECT_ERROR:
verbs_handle_connect_error(ep, cm_evt);
break; default: debug(CCI_DB_CONN, "ignoring %s event", rdma_event_str(cm_evt->event));

liuxuezhao commented 8 years ago

"Have you verified this server can ping the other server using the ib0 interface?" Yes, the ping works and the ib level benchmark also works. So the hardware link should be fine.

Today the nodes are used by other person to test, I'll wait one day. Just to confirm that should remove all previous patches and only apply the last one patch above?

Thanks.

scottatchley commented 8 years ago

On Jul 8, 2016, at 10:53 AM, Liu Xuezhao notifications@github.com wrote:

"Have you verified this server can ping the other server using the ib0 interface?" Yes, the ping works and the ib level benchmark also works. So the hardware link should be fine.

Today the nodes are used by other person to test, I'll wait one day. Just to confirm that should remove all previous patches and only apply the last one patch above?

Yes, thanks for the update.

scottatchley commented 8 years ago

Xuezhao provided another set of logs offline. The client is still getting a RDMA_CM_EVENT_CONNECT_ERROR and the server is seeing RDMA_CM_EVENT_UNREACHABLE. The patch should have printed more information about the first error, but it did not.

Xuezhao will be travel and they have not been able to give me external access yet. I am working on a multi-threaded test app that will open N endpoints with T threads per endpoint and C concurrent connections per endpoint. It will send a few messages candor RMAs, then close the endpoint. It will then initiate a new connection to maintain C open connections per endpoint. It will run for 10,000 connections or a requested amount of time.

I hope that this can trigger the issue. If not, I will wait for external access.

mercury-hpc / mercury

NA CCI: race in connection #111

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

0 0x0000003fece0e7cd in write () from /lib64/libpthread.so.0

1 0x00007fab4c14ae58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007fab371c4283 in verbs_get_cm_event (ep=0x2696b10) at ctp_verbs_api.c:2948

3 0x00007fab371c77f8 in verbs_progress_ep (ep=0x2696b10) at ctp_verbs_api.c:3784

4 0x00007fab371c8418 in ctp_verbs_get_event (endpoint=0x2696b18, event=0x7fff80017a28) at ctp_verbs_api.c:3819

5 0x00007fab535f690c in na_cci_progress (na_class=0x243ecb0, context=, timeout=)

0 0x00000030a7adb57d in write () from /lib64/libc.so.6

1 0x00007f15a646de58 in rdma_get_cm_event () from /usr/lib64/librdmacm.so.1

2 0x00007f159fbe9133 in verbs_get_cm_event (ep=0x32070f0) at ctp_verbs_api.c:2974

3 0x00007f159fbec788 in verbs_progress_ep (ep=0x32070f0) at ctp_verbs_api.c:3810

4 0x00007f159fbed3a8 in ctp_verbs_get_event (endpoint=0x32070f8, event=0x7ffdad5fd038) at ctp_verbs_api.c:3845

5 0x00007f15b1d4b941 in na_cci_progress (na_class=0x2a7bcb0, context=0x41a32c0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1886

6 0x00007f15b1d40972 in NA_Progress (na_class=0x2a7bcb0, context=0x41a32c0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

7 0x00007f15b2aa4098 in dtp_na_addr_lookup_wait (na_class=0x2a7bcb0, name=0x7ffdad5fd200 "verbs://192.168.1.41:22775", addr=0x7ffdad5fd288) at src/dtp/dtp_hg.c:109

0 0x00007f65dcddb7c0 in verbs_get_rdma_msg_event (ep=0x7f65a0004c30) at ctp_verbs_api.c:3604

1 verbs_progress_ep (ep=0x7f65a0004c30) at ctp_verbs_api.c:3816

2 0x00007f65dcddc3a8 in ctp_verbs_get_event (endpoint=0x7f65a0004c38, event=0x7f65a9ffa748) at ctp_verbs_api.c:3845

3 0x00007f65df341941 in na_cci_progress (na_class=0x7f65a0004950, context=0x7f65a0105730, timeout=)

4 0x00007f65df336972 in NA_Progress (na_class=0x7f65a0004950, context=0x7f65a0105730, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

5 0x00007f65df5542af in hg_core_progress_na (context=0x7f65a0105e60, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

6 hg_core_progress (context=0x7f65a0105e60, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

7 0x00007f65df554441 in HG_Core_progress (context=0x7f65a0105e60, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1 na_cci_progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1909

2 0x00007efdefd67972 in NA_Progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

3 0x00007efdeff852af in hg_core_progress_na (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007efdeff85441 in HG_Core_progress (context=0x7efdd01062e0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007efdf0c2416a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007efdf0c21eeb in dtp_progress (dtp_ctx=0x7efdd00048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x863770) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

1 na_cci_progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1909

2 0x00007efdefd67972 in NA_Progress (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1495

3 0x00007efdeff852af in hg_core_progress_na (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7efdd01062e0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007efdeff85441 in HG_Core_progress (context=0x7efdd01062e0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007efdf0c2416a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007efdf0c21eeb in dtp_progress (dtp_ctx=0x7efdd00048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x863770) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

0 handle_connect (na_class=0x7efdd0004950, context=0x7efdd0105bb0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1820

handle_connect(): connect to verbs://192.168.1.35:22775 failed with No route to host

handle_connect(): connect to verbs://192.168.1.35:22775 failed with No route to host

handle_connect(): connect to verbs://192.168.1.39:22769 failed with No route to host

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

1 na_cci_progress (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1942

2 0x00007f233b1f3972 in NA_Progress (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na.c:1482

3 0x00007f233b4114af in hg_core_progress_na (context=0x7f23201062b0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1926

4 hg_core_progress (context=0x7f23201062b0, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:1982

5 0x00007f233b411641 in HG_Core_progress (context=0x7f23201062b0, timeout=1000) at /scratch/DAOS_OPT/src/mercury_github/src/mercury_core.c:2931

6 0x00007f233c0b116a in dtp_hg_progress (hg_ctx=, timeout=) at src/dtp/dtp_hg.c:977

7 0x00007f233c0aeeeb in dtp_progress (dtp_ctx=0x7f23200048f0, timeout=-1, cond_cb=0x403910 , arg=0x0) at src/dtp/dtp_context.c:229

8 0x0000000000403b90 in dss_srv_handler (arg=0x14b23d0) at src/server/srv.c:146

9 0x000000386b007aa1 in start_thread () from /lib64/libpthread.so.0

10 0x000000386ace893d in clone () from /lib64/libc.so.6

0 handle_connect (na_class=0x7f2320004950, context=0x7f2320105b80, timeout=) at /scratch/DAOS_OPT/src/mercury_github/src/na/na_cci.c:1841

- }

- rdma_destroy_ep(vconn->id);