Closed HehuaTang closed 1 year ago
@HehuaTang the crash seems to be coming from rdma_get_cm_event in librdmacm, can you try running simple rdmacm test in k8s container,
client: ib_send_lat -R -x 3
server: ib_send_lat -R <server-rdma-ip> -x 3
I have try it and failed to run it in client . Server: [root@wjw-roce-test231-m /]# ib_send_lat -R -x 3 Port number 1 state is Down Couldn't set the link layer Couldn't get context for the device [root@wjw-roce-test231-m /]# [root@wjw-roce-test231-m /]# show_gids DEV PORT INDEX GID IPv4 VER DEV
mlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0 mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0 mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0 mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v2 eth0 n_gids_found=4 [root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -X 3 Events must be enabled to select a completion vector [root@wjw-roce-test231-m /]# ib_send_lat -d mlx5_19 -x 3
Client :
[root@wjw-roce-test220-m /]# show_gids DEV PORT INDEX GID IPv4 VER DEV
mlx5_14 1 0 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v1 eth0 mlx5_14 1 1 fe80:0000:0000:0000:f816:92ff:fe5d:9dff v2 eth0 mlx5_14 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v1 eth0 mlx5_14 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b15 10.38.155.21 v2 eth0 in_gids_found=4
[root@wjw-roce-test220-m /]# ib_send_lat -R 10.38.155.116 -x 3 -d mlx5_14 Segmentation fault
mount device into container, you can find all the sriov device but only one can work according to contain isolated network. "Devices": [ { "PathOnHost": "/dev/infiniband", "PathInContainer": "/dev/infiniband", "CgroupPermissions": "rwm" } ],
PS: if I delete -R option in client and server and I can run 'ib_send_lat test' successfully.
demsg: [416451.853362] ib_send_lat[95655]: segfault at 0 ip 00007f3b6ecc518e sp 00007fff234faa60 error 4 in librdmacm.so.1.3.43.0[7f3b6ecbc000+18000]
So it seems an issue in rdma-core and not in UCX.
Thanks you for helping me find the root cause.
close it as it is not ucx issue but librdmacm's issue.
Describe the bug
A clear and concise description of what the bug is. client stack backtrace : I0224 08:23:21.570152 67335 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0 [wjw-roce-test231-m:67322:a:67355] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 67355) ==== 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7fcd121fefd4] 1 /usr/local/ucx/lib/libucs.so.0(+0x2e2fc) [0x7fcd121ff2fc] 2 /usr/local/ucx/lib/libucs.so.0(+0x2e56b) [0x7fcd121ff56b] 3 /lib64/libpthread.so.0(+0xf630) [0x7fcd11dc2630] 4 /lib64/librdmacm.so.1(rdma_get_cm_event+0x39e) [0x7fcce66ff18e] 5 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5a7f) [0x7fcce6915a7f] 6 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7fcd121e8710] 7 /usr/local/ucx/lib/libucs.so.0(+0x1a1c3) [0x7fcd121eb1c3] 8 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7fcd12207fe9] 9 /usr/local/ucx/lib/libucs.so.0(+0x1a4fc) [0x7fcd121eb4fc] 10 /lib64/libpthread.so.0(+0x7ea5) [0x7fcd11dbaea5] 11 /lib64/libc.so.6(clone+0x6d) [0x7fcd100c8b0d]
Segmentation fault server : bt the same to client stack backtrace [root@wjw-roce-test231-m bu]# cp ../*.pem ./ [root@wjw-roce-test231-m bu]# UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_server I0224 11:20:01.514996 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_ctx.cpp:75] Running with ucp library version: 1.14.0 I0224 11:20:02.259763 104434 /orpc/orpc-dep/orpc/src/brpc/ucp_acceptor.cpp:323] Ucp server is listening on IP 0.0.0.0 port 13339, idle connection check interval: -1s I0224 11:20:02.259822 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1133] Server[example::EchoServiceImpl] is serving on port=8002. I0224 11:20:02.260134 104434 /orpc/orpc-dep/orpc/src/brpc/server.cpp:1136] Check out http://wjw-roce-test231-m:8002 in web browser.
[wjw-roce-test231-m:104434:a:104538] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 104538) ==== 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7f653a9c8374] 1 /usr/local/ucx/lib/libucs.so.0(+0x2e69c) [0x7f653a9c869c] 2 /usr/local/ucx/lib/libucs.so.0(+0x2e90b) [0x7f653a9c890b] 3 /lib64/libpthread.so.0(+0xf630) [0x7f653a58b630] 4 /usr/local/ucx/lib/ucx/libuct_rdmacm.so.0(+0x5c1b) [0x7f64ec3d1c1b] 5 /usr/local/ucx/lib/libucs.so.0(ucs_async_dispatch_handlers+0x160) [0x7f653a9b18d0] 6 /usr/local/ucx/lib/libucs.so.0(+0x1a563) [0x7f653a9b4563] 7 /usr/local/ucx/lib/libucs.so.0(ucs_event_set_wait+0xa9) [0x7f653a9d1389] 8 /usr/local/ucx/lib/libucs.so.0(+0x1a89c) [0x7f653a9b489c] 9 /lib64/libpthread.so.0(+0x7ea5) [0x7f653a583ea5] 10 /lib64/libc.so.6(clone+0x6d) [0x7f6538891b0d]
Segmentation fault
Steps to Reproduce
Command line UCX_TLS=^tcp UCX_IB_GID_INDEX=3 UCX_NET_DEVICES=mlx5_19:1 ./multi_threaded_echo_client -server=x.x.x.x:13339 --use_ucp=true --thread_num=1 --brpc_ucp_worker_busy_poll=true --attachment_size=2048
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
) ucx-1.14.x code was downloaded from ucx-1.14.x branch tag from github on date 2/23/2023ucx_info -v
Library version: 1.14.0
Library path: /usr/lib64/libucs.so.0
API headers version: 1.14.0
Git branch '', revision f8877c5
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7
Any UCX environment variables used
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
-[root@wjw-roce-test231-m bu]# cat /etc/issue \S Kernel \r on an \m [root@wjw-roce-test231-m bu]# cat /etc/issue \S Kernel \r on an \m [root@wjw-roce-test231-m bu]# cat /etc/redhat-release CentOS Linux release 7.6.1810 (Core) [root@wjw-roce-test231-m bu]# uname -a Linux wjw-roce-test231-m 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linuxcat /etc/mlnx-release
(the string identifies software and firmware setup)For RDMA/IB/RoCE related issues:
Driver version:
rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
[root@wjw-roce-test231-m bu]# rpm -q rmda-core package rmda-core is not installed [root@wjw-roce-test231-m bu]# rpm -qa | grep rdma rdma-core-devel-58mlnx43-1.58112.x86_64 librdmacm-utils-58mlnx43-1.58112.x86_64 rdma-core-58mlnx43-1.58112.x86_64 librdmacm-58mlnx43-1.58112.x86_64 ucx-rdmacm-1.14.0-1.58112.x86_64 [root@wjw-roce-test231-m bu]# rpm -qa |grep libibverbs libibverbs-58mlnx43-1.58112.x86_64 libibverbs-utils-58mlnx43-1.58112.x86_64 [root@wjw-roce-test231-m bu]# ofed_info -s MLNX_OFED_LINUX-5.8-1.1.2.1:HW information from
ibstat
oribv_devinfo -vv
command [root@wjw-roce-test231-m bu]# show_gids DEV PORT INDEX GID IPv4 VER DEVmlx5_19 1 0 fe80:0000:0000:0000:f816:05ff:fe26:942c v1 eth0 mlx5_19 1 1 fe80:0000:0000:0000:f816:05ff:fe26:942c v2 eth0 mlx5_19 1 2 0000:0000:0000:0000:0000:ffff:0a26:9b74 10.38.155.116 v1 eth0 mlx5_19 1 3 0000:0000:0000:0000:0000:ffff:0a26:9b74 1038.155.116 v2 eth0
3773- LMC: 0 3782- SM lid: 0 3794- Capability mask: 0x00010000 3824- Port GUID: 0x0000000000000000 3856- Link layer: Ethernet 3879:CA 'mlx5_19' 3892- CA type: MT4120 3909- Number of ports: 1 3929- Firmware version: 16.31.2006 3959- Hardware version: 0 3980- Node GUID: 0x0000000000000000 [root@wjw-roce-test231-m bu]# ibv_devinfo -vv | grep 5_19 -a5 -b5 58540- GID[ 0]: fe80:0000:0000:0000:f816:92ff:fec2:f696, RoCE v1 58603- GID[ 1]: fe80::f816:92ff:fec2:f696, RoCE v2 58652- GID[ 2]: 0000:0000:0000:0000:0000:ffff:0a26:9849, RoCE v1 58715- GID[ 3]: ::ffff:10.38.152.73, RoCE v2 58758- 58759:hca_id: mlx5_19 58775- transport: InfiniBand (0) 58804- fw_ver: 16.31.2006 58827- node_guid: 0000:0000:0000:0000 58861- sys_image_guid: b8ce:f603:000c:29c8 58900- vendor_id: 0x02c9
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
kubectl version Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.9", GitCommit:"367a19c21ce71e1b0b6e99fc2dff3929b9f13bc8", GitTreeState:"clean", BuildDate:"2022-02-10T03:28:52Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-csp.10.8", GitCommit:"ab4fd8847d5880ffcffaec92a73e4e7130ee49ca", GitTreeState:"clean", BuildDate:"2021-09-09T02:30:36Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXIsssue description : I run two rdma apps in pods, client and server. they are both in one namesapce sparated pod created by k8s . rdma device is sriov nic. The crash happened as above description. If I run the same app in two container created by docker run. they worked fine. So I
run --net=host--cap-add SYS_PTRACE --shm-size=8g --device=/dev/infiniband:/dev/infiniband:rw --name orpc-test-2 hub.xyz.com.orpc-rdma/orpc-rdma-depy:v1.0.0