The compute node received a bad completion with status 0xc after running the following command:
Memory Node: ./Server
Compute Node: ./db_bench --benchmarks=fillrandom,readrandom,readrandom,readrandomwriterandom --threads=1 --value_size=400 --num=100000000 --bloom_bits=10 --readwritepercent=5 --compute_node_id=0 --fixed_compute_shards_num=0
The complete log is as follows:
Mark: valgrind socket info1
searching for IB devices in host
found 4 device(s)
New MR was registered with addr=0x7f0aee555010, lkey=0x954f, rkey=0x954f, flags=0x7, size=10240000, total registered size is 0.009537, chunk type is 1
Max utilization is 4.028360
New MR was registered with addr=0x7f0aedb90010, lkey=0x2b8b, rkey=0x2b8b, flags=0x7, size=10240000, total registered size is 0.019073, chunk type is 1
SST buffer, send&receive buffer were registered with a
maximum outstanding wr number is 32768
maximum query pair number is 262144
maximum completion queue number is 16777216
maximum memory region number is 16777216
maximum memory region size is 18446744073709551615
Success to connect to 192.168.6.2
TCP connection was established
connect to node id 0
QP was created, QP number=0x92
Local LID = 0x0 target node id 0
total bytes: 23
read byte: 23Remote QP number = 0xf0
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0ae8002238 state was change to RTS
total bytes: 1
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
......
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
read byte: 1Finish the connection with node 0
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
Max utilization is 6.042539
New MR was registered with addr=0x7f0aa7fff010, lkey=0x235b, rkey=0x235b, flags=0x7, size=1073741824, total registered size is 1.019073, chunk type is 7
TimberSaw: version 1.22
Start to sync options
client handling thread
Date: Thu Jul 4 06:13:09 2024
Max utilization is 6.042539
CPU: 20 * Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
CPUCache:
Keys: 20 bytes each
Values: 400 bytes each (200 bytes after compression)
Entries: 100000000
RawSize: 40054.3 MB (estimated)
FileSize: 20980.8 MB (estimated)
DBImpl start
New MR was registered with addr=0x7f0aa5ffe010, lkey=0x8590, rkey=0x8590, flags=0x7, size=33554432, total registered size is 1.050323, chunk type is 2
Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1
Max utilization is 10.070899
communication thread created
DBImpl finished
Refresher start
DBImpl deallocated
Cache entried used is 0.000000
Version level 0 contain 0 files
Version level 1 contain 0 files
Version level 2 contain 0 files
Version level 3 contain 0 files
Version level 4 contain 0 files
Version level 5 contain 0 files
Total file size is 0.000000
Total number of entries within the cache is 0
DBImpl start
communication thread created
DBImpl finished
Refresher start
validation write finished
start front-end threads
Wait for thread start
node id 1
total bytes: 1
read byte: 1sync wait time is 142220
Threads start to run
Max utilization is 16.113439
New MR was registered with addr=0x7f0a8bfff010, lkey=0x9b5e, rkey=0x9b5e, flags=0x7, size=134217728, total registered size is 1.175323, chunk type is 6
Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1
New MR was registered with addr=0x7f0a84ffe010, lkey=0x8f4a, rkey=0x8f4a, flags=0x7, size=117440512, total registered size is 1.284698, chunk type is 3
Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1
New MR was registered with addr=0x7f0a9d7fe010, lkey=0xb595, rkey=0xb595, flags=0x7, size=33554432, total registered size is 1.315948, chunk type is 5
Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1
Remote memory registeration, size: 1073741824
polled reply bufferrops
QP was created, QP number=0x93
QP num to be sent = 0x93
Local LID = 0x0
QP was created, QP number=0x94
QP num to be sent = 0x94
Local LID = 0x0
Remote QP number=0xf1
Remote LID = 0x0ffer
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a94004db8 state was change to RTS
Remote QP number=0xf2ps
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a780414e8 state was change to RTS
Max utilization is 20.141798
Remote memory registeration, size: 1073741824
polled reply bufferr
New MR was registered with addr=0x7f0a6ffff010, lkey=0x8698, rkey=0x8698, flags=0x7, size=134217728, total registered size is 1.440948, chunk type is 6
Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 25436160
Max utilization is 22.155978
QP was created, QP number=0x95
QP num to be sent = 0x95
Local LID = 0x0
Remote QP number=0xf3
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a680414e8 state was change to RTS
Max utilization is 24.170158
Max utilization is 26.184338
QP was created, QP number=0x96
QP num to be sent = 0x96
Local LID = 0x0
Remote QP number=0xf4
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a640414e8 state was change to RTS
Max utilization is 28.198518
New MR was registered with addr=0x7f0a57fff010, lkey=0xc3c4, rkey=0xc3c4, flags=0x7, size=134217728, total registered size is 1.565948, chunk type is 6
Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 33824768
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 34.241057
Max utilization is 34.241057
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9
New MR was registered with addr=0x7f0a6cffe010, lkey=0xbea1, rkey=0xbea1, flags=0x7, size=50331648, total registered size is 1.612823, chunk type is 4
Memory used up, Initially, allocate new one, memory pool is IndexChunk_Small, total memory this pool is 1
QP was created, QP number=0x97
QP num to be sent = 0x97
Local LID = 0x0
QP was created, QP number=0x98
QP num to be sent = 0x98
Local LID = 0x0
Remote QP number=0xf5
Remote LID = 0x0ffer
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
Remote QP number=0xf6
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a940e4c78 state was change to RTS
QP 0x7f0a784e3088 state was change to RTS
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9
QP was created, QP number=0x99
QP num to be sent = 0x99
Local LID = 0x0
Remote QP number=0xf7
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a680e2528 state was change to RTS
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9
number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9
QP was created, QP number=0x9a
QP num to be sent = 0x9a
Local LID = 0x0
Remote QP number=0xf8
Remote LID = 0x0
Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4
QP 0x7f0a640e27b8 state was change to RTS
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81
RDMA Read Failed
q id isread_local
QP number=0x92
number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9
RDMA Read Failed
q id isread_local
QP number=0x92
corrupt bloom filter
thread local qp destroy successfully!
thread local cq destroy successfully!
thread local qp destroy successfully!
thread local cq destroy successfully!
Before running the code, I used the perftest tool to test the RDMA network, and it worked well.
Can you give me some suggestions? I appreciate it!
The compute node received a bad completion with status 0xc after running the following command:
Memory Node: ./Server Compute Node: ./db_bench --benchmarks=fillrandom,readrandom,readrandom,readrandomwriterandom --threads=1 --value_size=400 --num=100000000 --bloom_bits=10 --readwritepercent=5 --compute_node_id=0 --fixed_compute_shards_num=0 The complete log is as follows: Mark: valgrind socket info1 searching for IB devices in host found 4 device(s) New MR was registered with addr=0x7f0aee555010, lkey=0x954f, rkey=0x954f, flags=0x7, size=10240000, total registered size is 0.009537, chunk type is 1 Max utilization is 4.028360 New MR was registered with addr=0x7f0aedb90010, lkey=0x2b8b, rkey=0x2b8b, flags=0x7, size=10240000, total registered size is 0.019073, chunk type is 1 SST buffer, send&receive buffer were registered with a maximum outstanding wr number is 32768 maximum query pair number is 262144 maximum completion queue number is 16777216 maximum memory region number is 16777216 maximum memory region size is 18446744073709551615 Success to connect to 192.168.6.2 TCP connection was established connect to node id 0 QP was created, QP number=0x92
Local LID = 0x0 target node id 0 total bytes: 23 read byte: 23Remote QP number = 0xf0 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0ae8002238 state was change to RTS total bytes: 1 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 ...... Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 read byte: 1Finish the connection with node 0 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 Max utilization is 6.042539 New MR was registered with addr=0x7f0aa7fff010, lkey=0x235b, rkey=0x235b, flags=0x7, size=1073741824, total registered size is 1.019073, chunk type is 7 TimberSaw: version 1.22 Start to sync options client handling thread Date: Thu Jul 4 06:13:09 2024 Max utilization is 6.042539 CPU: 20 * Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPUCache:
Keys: 20 bytes each Values: 400 bytes each (200 bytes after compression) Entries: 100000000 RawSize: 40054.3 MB (estimated) FileSize: 20980.8 MB (estimated)
DBImpl start New MR was registered with addr=0x7f0aa5ffe010, lkey=0x8590, rkey=0x8590, flags=0x7, size=33554432, total registered size is 1.050323, chunk type is 2 Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1 Max utilization is 10.070899 communication thread created DBImpl finished Refresher start DBImpl deallocated Cache entried used is 0.000000 Version level 0 contain 0 files Version level 1 contain 0 files Version level 2 contain 0 files Version level 3 contain 0 files Version level 4 contain 0 files Version level 5 contain 0 files Total file size is 0.000000 Total number of entries within the cache is 0 DBImpl start communication thread created DBImpl finished Refresher start validation write finished start front-end threads Wait for thread start node id 1 total bytes: 1 read byte: 1sync wait time is 142220 Threads start to run Max utilization is 16.113439
New MR was registered with addr=0x7f0a8bfff010, lkey=0x9b5e, rkey=0x9b5e, flags=0x7, size=134217728, total registered size is 1.175323, chunk type is 6 Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1 New MR was registered with addr=0x7f0a84ffe010, lkey=0x8f4a, rkey=0x8f4a, flags=0x7, size=117440512, total registered size is 1.284698, chunk type is 3 Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1 New MR was registered with addr=0x7f0a9d7fe010, lkey=0xb595, rkey=0xb595, flags=0x7, size=33554432, total registered size is 1.315948, chunk type is 5 Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1 Remote memory registeration, size: 1073741824 polled reply bufferrops
QP was created, QP number=0x93
QP num to be sent = 0x93 Local LID = 0x0 QP was created, QP number=0x94
QP num to be sent = 0x94 Local LID = 0x0 Remote QP number=0xf1 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a94004db8 state was change to RTS Remote QP number=0xf2ps
Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a780414e8 state was change to RTS Max utilization is 20.141798 Remote memory registeration, size: 1073741824 polled reply bufferr New MR was registered with addr=0x7f0a6ffff010, lkey=0x8698, rkey=0x8698, flags=0x7, size=134217728, total registered size is 1.440948, chunk type is 6 Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 25436160 Max utilization is 22.155978 QP was created, QP number=0x95
QP num to be sent = 0x95 Local LID = 0x0 Remote QP number=0xf3 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a680414e8 state was change to RTS Max utilization is 24.170158 Max utilization is 26.184338 QP was created, QP number=0x96
QP num to be sent = 0x96 Local LID = 0x0 Remote QP number=0xf4 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a640414e8 state was change to RTS Max utilization is 28.198518 New MR was registered with addr=0x7f0a57fff010, lkey=0xc3c4, rkey=0xc3c4, flags=0x7, size=134217728, total registered size is 1.565948, chunk type is 6 Memory used up, allocate new one, memory pool is FlushBuffer, total memory is 33824768 Max utilization is 30.212697 Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 30.212697
Max utilization is 34.241057
Max utilization is 34.241057
number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 New MR was registered with addr=0x7f0a6cffe010, lkey=0xbea1, rkey=0xbea1, flags=0x7, size=50331648, total registered size is 1.612823, chunk type is 4 Memory used up, Initially, allocate new one, memory pool is IndexChunk_Small, total memory this pool is 1 QP was created, QP number=0x97
QP num to be sent = 0x97 Local LID = 0x0 QP was created, QP number=0x98
QP num to be sent = 0x98 Local LID = 0x0 Remote QP number=0xf5 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 Remote QP number=0xf6 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a940e4c78 state was change to RTS QP 0x7f0a784e3088 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 QP was created, QP number=0x99
QP num to be sent = 0x99 Local LID = 0x0 Remote QP number=0xf7 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a680e2528 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 number 1 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 2 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 3 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 4 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 5 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 6 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 7 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 8 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 9 got bad completion with status: 0x5, vendor syndrome: 0xf9 number 10 got bad completion with status: 0x5, vendor syndrome: 0xf9 QP was created, QP number=0x9a
QP num to be sent = 0x9a Local LID = 0x0 Remote QP number=0xf8 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:9a:f2:b3:ff:fe:c8:b8:d4 QP 0x7f0a640e27b8 state was change to RTS number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 RDMA Read Failed q id isread_local QP number=0x92 number 0 got bad completion with status: 0x5, vendor syndrome: 0xf9 RDMA Read Failed q id isread_local QP number=0x92 corrupt bloom filter thread local qp destroy successfully! thread local cq destroy successfully! thread local qp destroy successfully! thread local cq destroy successfully!
Before running the code, I used the perftest tool to test the RDMA network, and it worked well. Can you give me some suggestions? I appreciate it!