openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

Problem with client/server connection when atomics feature requested #4353

Open gmegan opened 5 years ago

gmegan commented 5 years ago

I am working on a bluefield development system running CentOS 7.6 with kernel version 5.2.1. When I try to run a simple server/client connection with atomics ucp feature requested, I get errors on the client side about no support for atomics on the transport to the server. I have tried with different UCX_TLS options but so far have not found any that work.

I am using the newest master branch of ucx and rdma-core git repos for this test. This system is running upstream kernel modules plus centos inbox libraries for the RDMA stack.

To recreate this failure, I modify the ucp_client_server.c code from /test/examples in the ucx repo. I make this change to the feature request line:

    ucp_params.features     = UCP_FEATURE_STREAM |
+      UCP_FEATURE_AMO32 | UCP_FEATURE_AMO64;

This results in the following error when running this example in client mode:

select.c:453  UCX  ERROR no atomic operations on registered memory transport to 10.0.0.60:13337: Unsupported operation

Here are full logs of the server and client program runs with debug logging enabled with more information:

server.out.txt

client.out.txt

shamisp commented 5 years ago

Looking at the code. Apparently inorder to enable AMO emulation you have to make sure that ucp_wireup_ep_params_is_err_mode_peer() passes through, which requires to un-set UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE and UCP_ERR_HANDLING_MODE_PEER.

Example can be found here: test/examples/ucp_client_server.c:142

@yosefe it is not documented anywhere that AMO enablement depends on error handling mode. I don't think the dependency makes much sense. No useful error is reported either ...

gmegan commented 5 years ago

Removing the peer error handling mode from the client connection will allow the client to create an ep to the server which uses atomics emulation.

However, due to the use of 'UCP_FORCE_MODE', the client will fail to close its ep to the server.

@@ -150,9 +150,8 @@ static int start_client(ucp_worker_h ucp_worker, const char *ip,
      */
     ep_params.field_mask       = UCP_EP_PARAM_FIELD_FLAGS       |
                                  UCP_EP_PARAM_FIELD_SOCK_ADDR   |
                                  UCP_EP_PARAM_FIELD_ERR_HANDLER |
-                                 UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE;
-    ep_params.err_mode         = UCP_ERR_HANDLING_MODE_PEER;
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=\\rc,\\rc_mlx5,\\ud,rdmacm,sm UCX_SHM_DEVICES=memory UCX_LOG_LEVEL=info ./ucp_client_server.x -a 10.0.0.60
[1572541859.331946] [blue:30630:0]     ucp_worker.c:1516 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); stream(posix/memory); 
stream_send_cb returned with status 0 (Success)

-----------------------------------------

Client sent message: 
UCX Client-Server Hello World.
length: 30

-----------------------------------------

failed to close ep 0xffff837bb000
brminich commented 5 years ago

With error handling SW emulation is disabled, because it implies messages exchange between sender and receiver. Since we do not have any keep alive mechanism yet, we would not be able to detect peer failure if it dies right after receiving SW emulation request (without sending a reply back). So several protocols are not supported with error handling (including RNDV and sync send)

shamisp commented 5 years ago

@brminich the constrained is not documented anywhere and no sensible error returned. Can we opt to not handle errors and use emulation ? What are the side-effects ?

My guess you don't want to "hang" the sender in the case receiver dies. But I guess in this case some error will be eventually signaled to QP, socket, etc ?

shamisp commented 5 years ago

Removing the peer error handling mode from the client connection will allow the client to create an ep to the server which uses atomics emulation.

However, due to the use of 'UCP_FORCE_MODE', the client will fail to close its ep to the server.

@@ -150,9 +150,8 @@ static int start_client(ucp_worker_h ucp_worker, const char *ip,
      */
     ep_params.field_mask       = UCP_EP_PARAM_FIELD_FLAGS       |
                                  UCP_EP_PARAM_FIELD_SOCK_ADDR   |
                                  UCP_EP_PARAM_FIELD_ERR_HANDLER |
-                                 UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE;
-    ep_params.err_mode         = UCP_ERR_HANDLING_MODE_PEER;
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=\\rc,\\rc_mlx5,\\ud,rdmacm,sm UCX_SHM_DEVICES=memory UCX_LOG_LEVEL=info ./ucp_client_server.x -a 10.0.0.60
[1572541859.331946] [blue:30630:0]     ucp_worker.c:1516 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); stream(posix/memory); 
stream_send_cb returned with status 0 (Success)

-----------------------------------------

Client sent message: 
UCX Client-Server Hello World.
length: 30

-----------------------------------------

failed to close ep 0xffff837bb000

It would be useful to add some trace to see why exactly it fails

brminich commented 5 years ago

Without error handling you will get some indication when some process fails (in most cases), but you may not be able to recover. Can you avoid using UCP_FORCE_MODE close mode and use FLUSH instead?

@evgeny-leksikov, @alinask is UCP_ERR_HANDLING_MODE_PEER is required when connecting with sock addr? Or is it just a limitation in our example?

evgeny-leksikov commented 5 years ago

without error handling mode peer and ep_close in force mode, UCP clean up requires out of band sync, today there is no difference how connection was established because we use RDMA_PS_UDP.

My guess you don't want to "hang" the sender in the case receiver dies.

@shamisp exactly

But I guess in this case some error will be eventually signaled to QP, socket, etc ?

No, there is a chance that TX WQs are completed successfully, receiver (or network) dies and initiator hangs waiting ACK, network is silent. TCP socket is signaling error due to keep alive protocol, we don't have it for IB transports yet.

brminich commented 5 years ago

so, as a conclusion, no need to use error handling and use FLUSH close mode instead of force?

evgeny-leksikov commented 5 years ago

that example does not have out of band TCP connection like hello_world has

shamisp commented 5 years ago

@evgeny-leksikov if you close connected QP, the other side will receive an event through async event handler. For UD, it is different story, but since you already have all the retransmission logic and progress thread, keep alive should be somewhat simple...

shamisp commented 5 years ago

@brminich - we can potentially use flush, but @gmegan hitting some other error in flush mode (she can comment here).No RNDV, No AMO , No Sync send - kinda very constraining. We want to use more and more server/client. I think Java/Python users also will use mostly server client mode.

evgeny-leksikov commented 5 years ago

if you close connected QP, the other side will receive an event through async event handler.

what is async event handler in this context? As I know IB does'n report such error if there is no traffic.

shamisp commented 5 years ago

@evgeny-leksikov Do you close the QP ?

shamisp commented 5 years ago

IBV_EVENT_QP_FATAL, but I think you are right about the fact that some WQE has to fail. Now, I recall that in my mvapich version of similar logic I had retransmit on timeout.

What about rdma_disconnect() ? If one side issues rdmacm disconnect, does it generate event on the other side ?

evgeny-leksikov commented 5 years ago

yep, this is what we are working on... using RDMA_PS_TCP + rdma_disconnect in UCT + close protocol in UCP

shamisp commented 5 years ago

@evgeny-leksikov make sense. What about use case when you don't have rdmacm ?

evgeny-leksikov commented 5 years ago

@shamisp Need to add the same approach to UCT/sockcm

gmegan commented 5 years ago

Without error handling you will get some indication when some process fails (in most cases), but you may not be able to recover. Can you avoid using UCP_FORCE_MODE close mode and use FLUSH instead?

@brminich When changing to FLUSH mode, the client can close the ep. However, the server encounters an error trying to close its ep. In my tests, this error occurs consistently the second time a client connects. The error is only on the server, and both server and client attempt to close the ep in FLUSH mode.

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=\\rc,\\rc_mlx5,\\ud,rdmacm,sm UCX_SHM_DEVICES=memory UCX_LOG_LEVEL=info ./ucp_client_server.x
server is listening on IP 0.0.0.0 port 13337
Waiting for connection...
[1572619137.421345] [blue:25599:0]     ucp_worker.c:1516 UCX  INFO  ep_cfg[2]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); stream(posix/memory); 
stream_recv_cb returned with status 0 (Success), length: 30
UCX data message was received

----- UCP TEST SUCCESS -------

UCX Client-Server Hello World

------------------------------

Waiting for another connection...
stream_recv_cb returned with status 0 (Success), length: 30
UCX data message was received

----- UCP TEST SUCCESS -------

UCX Client-Server Hello World

------------------------------

[blue:25599:0:25599]    ucp_ep.inl:137  Assertion `ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID' failed

/home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_ep.inl: [ ucp_ep_flush_state() ]
      ...
      134 static UCS_F_ALWAYS_INLINE ucp_ep_flush_state_t* ucp_ep_flush_state(ucp_ep_h ep)
      135 {
      136     ucs_assert(ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID);
==>   137     ucs_assert(!(ep->flags & UCP_EP_FLAG_ON_MATCH_CTX));
      138     ucs_assert(!(ep->flags & UCP_EP_FLAG_LISTENER));
      139     return &ucp_ep_ext_gen(ep)->flush_state;
      140 }

==== backtrace (tid:  25599) ====
 0 0x0000000000041d38 ucp_ep_flush_state()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_ep.inl:137
 1 0x0000000000041d38 ucp_ep_flush_progress()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucp/../../../src/ucp/rma/flush.c:111
 2 0x00000000000421b8 ucp_ep_flush_progress_pending()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucp/../../../src/ucp/rma/flush.c:177
 3 0x0000000000080c38 ucp_request_try_send()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_request.inl:171
 4 0x0000000000080c38 ucp_request_send()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_request.inl:206
 5 0x0000000000080c38 ucp_wireup_ep_progress()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucp/../../../src/ucp/wireup/wireup_ep.c:97
 6 0x0000000000061138 ucs_callbackq_slow_proxy()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucs/../../../src/ucs/datastruct/callbackq.c:397
 7 0x000000000002bca8 ucs_callbackq_dispatch()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:211
 8 0x0000000000032534 uct_worker_progress()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/uct/api/uct.h:2204
 9 0x00000000004018ec flush_ep()  /home/meggro01/src/osss-distro/ucx-test/ucp_client_server.c:272
10 0x0000000000401934 ep_close()  /home/meggro01/src/osss-distro/ucx-test/ucp_client_server.c:291
11 0x0000000000401ea4 server_conn_handle_cb()  /home/meggro01/src/osss-distro/ucx-test/ucp_client_server.c:468
12 0x00000000000229b4 ucp_listener_conn_request_progress()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucp/../../../src/ucp/core/ucp_listener.c:79
13 0x0000000000061138 ucs_callbackq_slow_proxy()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/src/ucs/../../../src/ucs/datastruct/callbackq.c:397
14 0x000000000002bca8 ucs_callbackq_dispatch()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:211
15 0x0000000000032534 uct_worker_progress()  /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/uct/api/uct.h:2204
16 0x0000000000402250 main()  /home/meggro01/src/osss-distro/ucx-test/ucp_client_server.c:615
17 0x00000000000215d4 __libc_start_main()  :0
18 0x000000000040120c _start()  :0
=================================
[blue:25599:0:25599] Process frozen, press Enter to attach a debugger...

Attaching to process 25599
Reading symbols from /home-local/meggro01/src/osss-distro/ucx-test/ucp_client_server.x...done.
Reading symbols from /home/meggro01/opt/ucx/lib/libucp.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/libucp.so.0
Reading symbols from /home/meggro01/opt/ucx/lib/libucs.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/libucs.so.0
Reading symbols from /home/meggro01/opt/ucx/lib/libuct.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/libuct.so.0
Reading symbols from /home/meggro01/opt/ucx/lib/libucm.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/libucm.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libnuma.so.1...Reading symbols from /lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnuma.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 25603]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib/ld-linux-aarch64.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux-aarch64.so.1
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /home/meggro01/opt/ucx/lib/ucx/libuct_ib.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/ucx/libuct_ib.so.0
Reading symbols from /home/meggro01/opt/rdma-core/build/lib/libibverbs.so.1...done.
Loaded symbols for /home/meggro01/opt/rdma-core/build/lib/libibverbs.so.1
Reading symbols from /home/meggro01/opt/rdma-core/build/lib/libmlx5.so.1...done.
Loaded symbols for /home/meggro01/opt/rdma-core/build/lib/libmlx5.so.1
Reading symbols from /lib64/libnl-route-3.so.200...Reading symbols from /lib64/libnl-route-3.so.200...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnl-route-3.so.200
Reading symbols from /lib64/libnl-3.so.200...Reading symbols from /lib64/libnl-3.so.200...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnl-3.so.200
Reading symbols from /home/meggro01/opt/ucx/lib/ucx/libuct_rdmacm.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/ucx/libuct_rdmacm.so.0
Reading symbols from /home/meggro01/opt/rdma-core/build/lib/librdmacm.so.1...done.
Loaded symbols for /home/meggro01/opt/rdma-core/build/lib/librdmacm.so.1
Reading symbols from /home/meggro01/opt/ucx/lib/ucx/libuct_cma.so.0...done.
Loaded symbols for /home/meggro01/opt/ucx/lib/ucx/libuct_cma.so.0
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000ffff8b64638c in waitpid () from /lib64/libc.so.6
#0  0x0000ffff8b64638c in waitpid () from /lib64/libc.so.6
#1  0x0000ffff8b812dec in ucs_debugger_attach () at ../../../src/ucs/debug/debug.c:703
#2  0x0000ffff8b813294 in ucs_error_freeze (
    message=0xffffd93ec270 "Assertion `ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID' failed")
    at ../../../src/ucs/debug/debug.c:822
#3  0x0000ffff8b813a1c in ucs_handle_error (
    message=0xffffd93ec270 "Assertion `ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID' failed")
    at ../../../src/ucs/debug/debug.c:992
#4  0x0000ffff8b80ff20 in ucs_fatal_error_message (
    file=0xffff8ba15ea0 "/home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_ep.inl", line=137, 
    function=0xffff8ba16298 <__FUNCTION__.30362> "ucp_ep_flush_state", 
    message_buf=0xffffd93ec270 "Assertion `ep->flags & UCP_EP_FLAG_FLUSH_STATE_VALID' failed")
    at ../../../src/ucs/debug/assert.c:33
#5  0x0000ffff8b810070 in ucs_fatal_error_format (
    file=0xffff8ba15ea0 "/home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_ep.inl", line=137, 
    function=0xffff8ba16298 <__FUNCTION__.30362> "ucp_ep_flush_state", format=0xffff8ba15e88 "Assertion `%s' failed")
    at ../../../src/ucs/debug/assert.c:49
#6  0x0000ffff8b9bbd38 in ucp_ep_flush_state (ep=0xffff89c34068)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_ep.inl:137
#7  ucp_ep_flush_progress (req=0xeb96fc0) at ../../../src/ucp/rma/flush.c:111
#8  0x0000ffff8b9bc1b8 in ucp_ep_flush_progress_pending (self=0xeb97068) at ../../../src/ucp/rma/flush.c:177
#9  0x0000ffff8b9fac38 in ucp_request_try_send (pending_flags=0, req_status=0xffffd93ec97f, req=0xeb96fc0)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_request.inl:171
#10 ucp_request_send (pending_flags=0, req=0xeb96fc0)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucp/core/ucp_request.inl:206
#11 ucp_wireup_ep_progress (arg=0xeb8e930) at ../../../src/ucp/wireup/wireup_ep.c:97
#12 0x0000ffff8b808138 in ucs_callbackq_slow_proxy (arg=0xe916bc0) at ../../../src/ucs/datastruct/callbackq.c:397
#13 0x0000ffff8b9a5ca8 in ucs_callbackq_dispatch (cbq=0xe916bc0)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:211
#14 0x0000ffff8b9ac534 in uct_worker_progress (worker=0xe916bc0)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/uct/api/uct.h:2204
#15 ucp_worker_progress (worker=0xffff8a23f010) at ../../../src/ucp/core/ucp_worker.c:1923
#16 0x00000000004018ec in flush_ep (worker=0xffff8a23f010, ep=0xffff89c34068) at ucp_client_server.c:272
#17 0x0000000000401934 in ep_close (ucp_worker=0xffff8a23f010, ep=0xffff89c34068) at ucp_client_server.c:291
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.aarch64 libgcc-4.8.5-36.el7_6.1.aarch64 libnl3-3.2.28-4.el7.aarch64 numactl-libs-2.0.9-7.el7.aarch64 zlib-1.2.7-18.el7.aarch64
---Type <return> to continue, or q <return> to quit---
#18 0x0000000000401ea4 in server_conn_handle_cb (conn_request=0xffff84050220, arg=0xffffd93ecd80)
    at ucp_client_server.c:468
#19 0x0000ffff8b99c9b4 in ucp_listener_conn_request_progress (arg=0xffff84050220)
    at ../../../src/ucp/core/ucp_listener.c:79
#20 0x0000ffff8b808138 in ucs_callbackq_slow_proxy (arg=0xe90ba10) at ../../../src/ucs/datastruct/callbackq.c:397
#21 0x0000ffff8b9a5ca8 in ucs_callbackq_dispatch (cbq=0xe90ba10)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:211
#22 0x0000ffff8b9ac534 in uct_worker_progress (worker=0xe90ba10)
    at /home/meggro01/src/osss-distro/BUILD/ucx/BUILD/../src/uct/api/uct.h:2204
#23 ucp_worker_progress (worker=0xffff8b058010) at ../../../src/ucp/core/ucp_worker.c:1923
#24 0x0000000000402250 in main (argc=1, argv=0xffffd93eced8) at ucp_client_server.c:615
557     ucp_cleanup(*ucp_context);
558 err:
559     return ret;
560 }
561 
562 
563 int main(int argc, char **argv)
564 {
565     ucx_server_ctx_t context;
566     char *server_addr = NULL;
(gdb) 
alinask commented 5 years ago

@gmegan Which changes did you make to the code? Can you please add the 'git diff' output here? Also, which ucx branch are you using?