redn-io / RedN

Arbitrary offloads for RDMA NICs
https://www.redn.io/
84 stars 20 forks source link

How to run linkedlist_bench ? #2

Open LiYuTingxxn opened 2 years ago

LiYuTingxxn commented 2 years ago

Hello! I'm trying to run linkedlist_bench these days. But there were some errors when I run it. Could you please tell me how to solve them? I have attached the error information below. I set the OFFLOAD_COUNT=1 and LIST_SIZE-=1. Thank you very much!

SERVER: DEBUG[tid:15641][src/rdma/verbs.c:714]: POST --> SEND_ENABLE(WR#1) [master = 0] [worker = 3] updating scur_post 0 by 1 (original size 2) DEBUG[tid:15641][src/rdma/verbs.c:1112]: POST --> RECV (WR #99) [send_fd: 2, qp_num 5323] DEBUG[tid:15641][src/rdma/verbs.c:1115]: ----------- sge0 [addr 7f46d8339129, length 3] mlx5: fs240: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000002 00000000 00000000 00000000 00000000 9d003304 000014cb 00006ae2 COMPLETION FAILURE on sockfd 2 (SEND WR #99) status[4] = local protection error

CLINET: Starting benchmark ... --> Send GET [key 1000] DEBUG[tid:72599][src/rdma/verbs.c:1006]: POST --> RDMA_SEND_IMM (SEND WR 1) [send_fd: 2 batch_size: 1] updating scur_post 0 by 1 (original size 2) DEBUG[tid:72599][src/rdma/verbs.c:714]: POST --> SEND_ENABLE(WR#1) [master = 0] [worker = 2] updating scur_post 0 by 1 (original size 2) mlx5: fs237: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000002 00000000 00000000 00000000 00000000 00008914 0b000ded 000059d2 COMPLETION FAILURE on sockfd 2 (SEND WR #1) status[11] = remote operation error DEBUG[tid:72602][src/rdma/connection.c:521]: received event[10]: RDMA_CM_EVENT_DISCONNECTED DEBUG[tid:72602][src/rdma/connection.c:586]: trigger disconnection callback Connection terminated [sockfd:2] DEBUG[tid:72602][src/rdma/connection.c:1277]: terminating connection on socket #2 modify state for socket #2 from 2 to -1 DEBUG[tid:72602][src/rdma/connection.c:521]: received event[15]: RDMA_CM_EVENT_TIMEWAIT_EXIT DEBUG[tid:72602][src/rdma/connection.c:1318]: clearing connection metadata for socket #2 DEBUG[tid:72602][src/rdma/connection.c:1343]: deregistering msg_send_mr[addr:7f39840a9000, len:304] DEBUG[tid:72602][src/rdma/connection.c:1345]: deregistering msg_rcv_mr[addr:7f39840aa000, len:304]

ChengjunJia commented 2 years ago

I hope that the following messages would be helpful. 使用RedN源码,优先使用hash这个测试用例,其他测试用例可能存在问题。 Anyone who wants to run RedN can use the hash micro benchmark. The code for the linkedlist may have some bugs, while the test cases for memcached can not run (Makefiles are missed).

From the hash_bench.c, the last command is IBV_RECEIVE_SG(client, recv_meta, worker_wq_mr->lkey); The command is for the cross channel communication from the client qp (receive the message from another server) to the worker qp (a sequence of WQEs to lookup, i.e. the core design in RedN). The client qp wants to modify some WQEs in the worker qp, so the key is for the visit of wq_buffer. worker_wq_mr->lkey is from register_wq(worker, worker), which calls the ibv_reg_mr to register the ibv_ex_get_wq_buffer of the worker qp. Finally, the client qp uses the key with the ability to modify the worker qp's WQEs (i.e. a memory region).

The main problem for the linkedlist bench is the Memory Region Protection. It uses the key mr_local_key(worker, mr_get_sq_idx(worker)) for the client qp to modify the worker qp's WQEs, but the key is for the DATA region! Thus, the first WQE in client qp, whose sockfd is 2, fails due to the local protection error.

In theory, the disable_wqe_checks.sh should work for the closure of the protection check, but it fails for my server (From the output, there is no change for the NIC register configuration). I am not sure whether my NIC firmware is wrong or the OFED version has problems (I have to use the latest 4.9 to adapt to my OS). If you have similar problems, I recommend the hash micro-benchmark instead of others.

authwork commented 1 year ago
0000:af:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=eno5 drv=mlx5_core unused= *Active*
0000:af:00.1 'MT27710 Family [ConnectX-4 Lx] 1015' if=eno6 drv=mlx5_core unused=

eno5: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.10.3  netmask 255.255.255.0  broadcast 192.168.10.255
        ether 70:fd:45:af:ed:30  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

It appears, how to address this please?

-i eno5
Starting program: RedN/bench/micro/hash_bench -i eno5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Mapping dram memory: size 268265456 bytes
Mapping dram memory: size 268265456 bytes
DEBUG[tid:2287139][src/rdma/connection.c:993]: initializing RC module
[New Thread 0x7fffd7b3d700 (LWP 2287144)]
DEBUG[tid:2287139][src/rdma/agent.c:123]: attempting to add connection to 192.168.10.3:12345

Thread 2 "hash_bench" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7b3d700 (LWP 2287144)]
0x00007ffff7c1f4f6 in rdma_bind_addr (id=0x0, addr=0x7fffd7b3cec0) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:71
71        return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest));
(gdb) bt
#0  0x00007ffff7c1f4f6 in rdma_bind_addr (id=0x0, addr=0x7fffd7b3cec0) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:71
#1  0x00007ffff7fbd5d6 in server_loop (port=0x7ffff7fc7450 <port>) at src/rdma/agent.c:104
#2  0x00007ffff7e2d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3  0x00007ffff7d52133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
authwork commented 1 year ago

The result from another machine, please check @wreda @ChengjunJia, thanks a lot.

r -i enp175s0f0
Starting program: ./RedN/bench/micro/hash_bench -i enp175s0f0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Mapping dram memory: size 268265456 bytes
Mapping dram memory: size 268265456 bytes
DEBUG[tid:20204][src/rdma/connection.c:993]: initializing RC module
[New Thread 0x7fffd7b25700 (LWP 20208)]
[RDMA-Server] Listening on port 12345 for connections. interrupt (^C) to exit.
DEBUG[tid:20204][src/rdma/agent.c:128]: attempting to add connection to 192.168.10.35:12345
DEBUG[tid:20204][src/rdma/connection.c:56]: adding connection on socket #0
[RDMA-Client] Creating connection (status:pending) to 192.168.10.35:12345 on sockfd 0
---- Initializing hashmap ----
bucket addr 140737080455168
bucket[0] key=232 addr=140737080455179
bucket[1] key=233 addr=140737080717334
bucket[2] key=234 addr=140737080979489
bucket[3] key=235 addr=140737081241644
bucket[4] key=236 addr=140737081503799
bucket[5] key=237 addr=140737081765954
bucket[6] key=238 addr=140737082028109
bucket[7] key=239 addr=140737082290264
bucket[8] key=240 addr=140737082552419
bucket[9] key=241 addr=140737082814574
DEBUG[tid:20208][src/rdma/connection.c:521]: received event[0]: RDMA_CM_EVENT_ADDR_RESOLVED
DEBUG[tid:20208][src/rdma/connection.c:521]: received event[2]: RDMA_CM_EVENT_ROUTE_RESOLVED
DEBUG[tid:20208][src/rdma/connection.c:324]: initializing rdma device-0
creating background thread to poll completions (blocking)
[New Thread 0x7fffd72e2700 (LWP 20210)]
DEBUG[tid:20208][src/rdma/connection.c:458]: Creating QP for sock #0 [SendQ - size: 1024] [RecvQ - size: 1024] flags 0
queue pair creation failed [error code: 12]
Couldn't read debug register: No such process.
(gdb) [Thread 0x7fffd72e2700 (LWP 20210) exited]
[Thread 0x7fffd7b25700 (LWP 20208) exited]
[Inferior 1 (process 20204) exited with code 01]
(gdb) bt
No stack.
ChengjunJia commented 1 year ago

@authwork Can you run the perftest correctly on your server? From the error output, the qp creation has not been successful. It seems about the RDMA driver or your configuration.

authwork commented 1 year ago

Yes, I reinstall os and driver to solve above issue. Thanks @ChengjunJia