ruihong123 / dLSM

dLSM: An LSM-Based Index for RDMA-Enabled Memory Disaggregation
BSD 3-Clause "New" or "Revised" License
28 stars 7 forks source link

Got bad completion with status 0xc. #7

Closed xiangpingzhang closed 3 months ago

xiangpingzhang commented 3 months ago

The compute node received a bad completion with status 0xc after running the following command:

Memory Node: ./Server Compute Node: ./db_bench --benchmarks=fillrandom,readrandom,readrandom,readrandomwriterandom --threads=1 --value_size=400 --num=100000000 --bloom_bits=10 --readwritepercent=5 --compute_node_id=0 --fixed_compute_shards_num=0 The complete log is as follows: Mark: valgrind socket info1 searching for IB devices in host found 4 device(s) New MR was registered with addr=0x7f0aee555010, lkey=0x954f, rkey=0x954f, flags=0x7, size=10240000, total registered size is 0.009537, chunk type is 1 Max utilization is 4.028360 New MR was registered with addr=0x7f0aedb90010, lkey=0x2b8b, rkey=0x2b8b, flags=0x7, size=10240000, total registered size is 0.019073, chunk type is 1 SST buffer, send&receive buffer were registered with a maximum outstanding wr number is 32768 maximum query pair number is 262144 maximum completion queue number is 16777216 maximum memory region number is 16777216 maximum memory region size is 18446744073709551615 Success to connect to 192.168.6.2 TCP connection was established connect to node id 0 QP was created, QP number=0x92

Local LID = 0x0 target node id 0 total bytes: 23 read byte: 23Remote QP number = 0xf0 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0ae8002238 state was change to RTS total bytes: 1 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 ...... Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 read byte: 1Finish the connection with node 0 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 New MR was registered with addr=0x7f0aa7fff010, lkey=0x235b, rkey=0x235b, flags=0x7, size=1073741824, total registered size is 1.019073, chunk type is 7 TimberSaw: version 1.22 Start to sync options client handling thread Date: Thu Jul 4 06:13:09 2024 Max utilization is 6.042539 CPU: 20 * Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPUCache:
Keys: 20 bytes each Values: 400 bytes each (200 bytes after compression) Entries: 100000000 RawSize: 40054.3 MB (estimated) FileSize: 20980.8 MB (estimated)

DBImpl start New MR was registered with addr=0x7f0aa5ffe010, lkey=0x8590, rkey=0x8590, flags=0x7, size=33554432, total registered size is 1.050323, chunk type is 2 Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1 Max utilization is 10.070899 communication thread created DBImpl finished Refresher start DBImpl deallocated Cache entried used is 0.000000 Version level 0 contain 0 files Version level 1 contain 0 files Version level 2 contain 0 files Version level 3 contain 0 files Version level 4 contain 0 files Version level 5 contain 0 files Total file size is 0.000000 Total number of entries within the cache is 0 DBImpl start communication thread created DBImpl finished Refresher start validation write finished start front-end threads Wait for thread start node id 1 total bytes: 1 read byte: 1sync wait time is 142220 Threads start to run Max utilization is 16.113439
New MR was registered with addr=0x7f0a8bfff010, lkey=0x9b5e, rkey=0x9b5e, flags=0x7, size=134217728, total registered size is 1.175323, chunk type is 6 Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1 New MR was registered with addr=0x7f0a84ffe010, lkey=0x8f4a, rkey=0x8f4a, flags=0x7, size=117440512, total registered size is 1.284698, chunk type is 3 Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1 New MR was registered with addr=0x7f0a9d7fe010, lkey=0xb595, rkey=0xb595, flags=0x7, size=33554432, total registered size is 1.315948, chunk type is 5 Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1 Remote memory registeration, size: 1073741824 polled reply bufferrops
QP was created, QP number=0x93

QP num to be sent = 0x93 Local LID = 0x0 QP was created, QP number=0x94

QP num to be sent = 0x94 Local LID = 0x0 Remote QP number=0xf1 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a94004db8 state was change to RTS Remote QP number=0xf2ps
Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a780414e8 state was change to RTS Max utilization is 20.141798 Remote memory registeration, size: 1073741824 polled reply bufferr New MR was registered with addr=0x7f0a6ffff010, lkey=0x8698, rkey=0x8698, flags=0x7, size=134217728, total registered size is 1.440948, chunk type is 6 Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 25436160 Max utilization is 22.155978 QP was created, QP number=0x95

QP num to be sent = 0x95 Local LID = 0x0 Remote QP number=0xf3 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a680414e8 state was change to RTS Max utilization is 24.170158 Max utilization is 26.184338 QP was created, QP number=0x96

QP num to be sent = 0x96 Local LID = 0x0 Remote QP number=0xf4 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a640414e8 state was change to RTS Max utilization is 28.198518 New MR was registered with addr=0x7f0a57fff010, lkey=0xc3c4, rkey=0xc3c4, flags=0x7, size=134217728, total registered size is 1.565948, chunk type is 6 Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 33824768 Max utilization is 30.212697 Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 34.241057
Max utilization is 34.241057
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 New MR was registered with addr=0x7f0a6cffe010, lkey=0xbea1, rkey=0xbea1, flags=0x7, size=50331648, total registered size is 1.612823, chunk type is 4 Memory used up, Initially, allocate new one, memory pool is IndexChunk_Small, total memory this pool is 1 QP was created, QP number=0x97

QP num to be sent = 0x97 Local LID = 0x0 QP was created, QP number=0x98

QP num to be sent = 0x98 Local LID = 0x0 Remote QP number=0xf5 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 Remote QP number=0xf6 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a940e4c78 state was change to RTS QP 0x7f0a784e3088 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 QP was created, QP number=0x99

QP num to be sent = 0x99 Local LID = 0x0 Remote QP number=0xf7 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a680e2528 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 QP was created, QP number=0x9a

QP num to be sent = 0x9a Local LID = 0x0 Remote QP number=0xf8 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a640e27b8 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 RDMA Read Failed q id isread_local QP number=0x92 number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9 RDMA Read Failed q id isread_local QP number=0x92 corrupt bloom filter thread local qp destroy successfully! thread local cq destroy successfully! thread local qp destroy successfully! thread local cq destroy successfully!

Before running the code, I used the perftest tool to test the RDMA network, and it worked well. Can you give me some suggestions? I appreciate it!