openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 423 forks source link

ucx panic in func rdma_get_cm_event #8905

Closed HehuaTang closed 1 year ago

HehuaTang commented 1 year ago

Describe the bug

A clear and concise description of what the bug is. client stack backtrace : I0224 08:23:21.570152 67335 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0 [wjw-roce-test231-m:67322:a:67355] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 67355) ==== 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7fcd121fefd4] 1 /usr/local/ucx/lib/libucs.so.0(+0x2e2fc) [0x7fcd121ff2fc] 2 /usr/local/ucx/lib/libucs.so.0(+0x2e56b) [0x7fcd121ff56b] 3 /lib64/libpthread.so.0(+0xf630) [0x7fcd11dc2630] 4 /lib64/librdmacm.so.1(rdma_get_cm_event+0x39e) [0x7fcce66ff18e] 5 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5a7f) [0x7fcce6915a7f] 6 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7fcd121e8710] 7 /usr/local/ucx/lib/libucs.so.0(+0x1a1c3) [0x7fcd121eb1c3] 8 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7fcd12207fe9] 9 /usr/local/ucx/lib/libucs.so.0(+0x1a4fc) [0x7fcd121eb4fc] 10 /lib64/libpthread.so.0(+0x7ea5) [0x7fcd11dbaea5] 11 /lib64/libc.so.6(clone+0x6d) [0x7fcd100c8b0d]

Segmentation fault server : bt the same to client stack backtrace [root@wjw-roce-test231-m bu]# cp ../*.pem ./ [root@wjw-roce-test231-m bu]# UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_server I0224 11:20:01.514996 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0 I0224 11:20:02.259763 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_acceptor.cpp:323] Ucp server is listening on IP 0.0.0.0 port 13339, idle connection check interval: -1s I0224 11:20:02.259822 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1133] Server[example::EchoServiceImpl] is serving on port=8002. I0224 11:20:02.260134 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1136] Check out http://wjw-roce-test231-m:8002 in web browser.

[wjw-roce-test231-m:104434:a:104538] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 104538) ==== 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f653a9c8374] 1 /usr/local/ucx/lib/libucs.so.0(+0x2e69c) [0x7f653a9c869c] 2 /usr/local/ucx/lib/libucs.so.0(+0x2e90b) [0x7f653a9c890b] 3 /lib64/libpthread.so.0(+0xf630) [0x7f653a58b630] 4 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5c1b) [0x7f64ec3d1c1b] 5 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7f653a9b18d0] 6 /usr/local/ucx/lib/libucs.so.0(+0x1a563) [0x7f653a9b4563] 7 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7f653a9d1389] 8 /usr/local/ucx/lib/libucs.so.0(+0x1a89c) [0x7f653a9b489c] 9 /lib64/libpthread.so.0(+0x7ea5) [0x7f653a583ea5] 10 /lib64/libc.so.6(clone+0x6d) [0x7f6538891b0d]

Segmentation fault

Steps to Reproduce

Setup and versions

3773- LMC: 0 3782- SM lid: 0 3794- Capability mask: 0x00010000 3824- Port GUID: 0x0000000000000000 3856- Link layer: Ethernet 3879:CA 'mlx5_19' 3892- CA type: MT4120 3909- Number of ports: 1 3929- Firmware version: 16.31.2006 3959- Hardware version: 0 3980- Node GUID: 0x0000000000000000 [root@wjw-roce-test231-m bu]# ibv_devinfo -vv | grep 5_19 -a5 -b5 58540- GID[ 0]: fe80:0000:0000:0000:f816:92ff:fec2:f696, RoCE v1 58603- GID[ 1]: fe80::f816:92ff:fec2:f696, RoCE v2 58652- GID[ 2]: 0000:0000:0000:0000:0000:ffff:0a26:9849, RoCE v1 58715- GID[ 3]: ::ffff:10.38.152.73, RoCE v2 58758- 58759:hca_id: mlx5_19 58775- transport: InfiniBand (0) 58804- fw_ver: 16.31.2006 58827- node_guid: 0000:0000:0000:0000 58861- sys_image_guid: b8ce:f603:000c:29c8 58900- vendor_id: 0x02c9

Additional information (depending on the issue)

Isssue description : I run two rdma apps in pods, client and server. they are both in one namesapce sparated pod created by k8s . rdma device is sriov nic. The crash happened as above description. If I run the same app in two container created by docker run. they worked fine. So I
run --net=host--cap-add SYS_PTRACE --shm-size=8g --device=/dev/infiniband:/dev/infiniband:rw --name orpc-test-2 hub.xyz.com.orpc-rdma/orpc-rdma-depy:v1.0.0

yosefe commented 1 year ago

@HehuaTang the crash seems to be coming from rdma_get_cm_event in librdmacm, can you try running simple rdmacm test in k8s container, client: ib_send_lat -R -x 3 server: ib_send_lat -R <server-rdma-ip> -x 3

HehuaTang commented 1 year ago

I have try it and failed to run it in client . Server: [root@wjw-roce-test231-m /]# ib_send_lat -R -x 3 Port number 1 state is Down Couldn't set the link layer Couldn't get context for the device [root@wjw-roce-test231-m /]# [root@wjw-roce-test231-m /]# show_gids DEV PORT INDEX GID IPv4 VER DEV


mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0 mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0 mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0 mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v2 eth0 n_gids_found=4 [root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -X 3 Events must be enabled to select a completion vector [root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -x 3


Client :

[root@wjw-roce-test220-m /]# show_gids DEV PORT INDEX GID IPv4 VER DEV


mlx5_14 1 0 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v1 eth0 mlx5_14 1 1 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v2 eth0 mlx5_14 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v1 eth0 mlx5_14 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v2 eth0 in_gids_found=4

[root@wjw-roce-test220-m /]# ib_send_lat -R 10.38.155.116 -x 3 -d mlx5_14 Segmentation fault

mount device into container, you can find all the sriov device but only one can work according to contain isolated network. "Devices": [ { "PathOnHost": "/dev/infiniband", "PathInContainer": "/dev/infiniband", "CgroupPermissions": "rwm" } ],

PS: if I delete -R option in client and server and I can run 'ib_send_lat test' successfully.

demsg: [416451.853362] ib_send_lat[95655]: segfault at 0 ip 00007f3b6ecc518e sp 00007fff234faa60 error 4 in librdmacm.so.1.3.43.0[7f3b6ecbc000+18000]

yosefe commented 1 year ago

So it seems an issue in rdma-core and not in UCX.

HehuaTang commented 1 year ago

Thanks you for helping me find the root cause.

HehuaTang commented 1 year ago

close it as it is not ucx issue but librdmacm's issue.