Open SmallCoal2001 opened 1 year ago
Please show me the whole log of the error, maybe I can figure out what was happening.
Mark: valgrind socket info1 searching for IB devices in host found 2 device(s) device not specified, using first one found: mlx5_0 New MR was registered with addr=0x7faa0b0e1010, lkey=0x1825e4, rkey=0x1825e4, flags=0xf, size=10240000, total registered size is 0 New MR was registered with addr=0x7faa0a71c010, lkey=0x17fcbc, rkey=0x17fcbc, flags=0xf, size=10240000, total registered size is 10240000 SST buffer, send&receive buffer were registered with a maximum outstanding wr number is32768 maximum query pair number is131072 maximum completion queue number is16777216 maximum memory region number is16777216 maximum memory region size is18446744073709551615 connect to node id 0QP was created, QP number=0x25d7
DBImpl start New MR was registered with addr=0x7fa9c1ffe010, lkey=0x33fff, rkey=0x33fff, flags=0xf, size=33554432, total registered size is 1094221824 Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1 RDMA write successfully communication thread created DBImpl finished level 0 file equals 0 marker Version get garbage collected version garbage collected. level 0 file equals 0 marker Version get garbage collected version garbage collected. May be schedule a background task! DBImpl deallocated May be schedule a background task! May be schedule a background task! Version get garbage collected version garbage collected. remained versuins number is 199344864version garbage collected. Memtable 0x55d8be288600 deallocated Total number of entries within the cahce is 0DBImpl start RDMA write successfully communication thread created DBImpl finished level 0 file equals 0 marker Version get garbage collected version garbage collected. level 0 file equals 0 marker Version get garbage collected version garbage collected. May be schedule a background task! The second open finished. The benchmark start. validation write finished start front-end threads Wait for thread start total bytes: 1read byte: 1sync wait time is 227873Threads start to run Add a new file, current immtable number is 1mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 4 New MR was registered with addr=0x7fa973fff010, lkey=0xcb18, rkey=0xcb18, flags=0xf, size=1073741824, total registered size is 1127776256 Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1 New MR was registered with addr=0x7fa933ffe010, lkey=0x1c919, rkey=0x1c919, flags=0xf, size=1073741824, total registered size is 2201518080 Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1 Add a new file, current immtable number is 2mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 5 New MR was registered with addr=0x7fa8f3ffd010, lkey=0x20a1a, rkey=0x20a1a, flags=0xf, size=1073741824, total registered size is 3275259904 Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1 Remote memory registeration, size: 1073741824 polled reply bufferr QP was created, QP number=0x25d8
QP num to be sent = 0x25d8 Local LID = 0x0 QP was created, QP number=0x25d9 Polling reply buffer QP num to be sent = 0x25d9 Local LID = 0x0uffer Remote QP number=0x6a9 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa9b8005bd8 state was change to RTS Remote QP number=0x6aa Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa9b40088f8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152656 One more local write buffer is added, now 3 total sst offset is 9627984 For flush, Total number of key touched is 153846, KV left is 152722 One more local write buffer is added, now 3 total sst offset is 9618133 BloomFilter block size is 190922index block size: 36543 start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 43, 210, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0, BloomFilter block size is 190922index block size: 36444 start of the this block is0, 31, 3, 0, 0, 0, 0, 0, 0, 0, 241, 60, 60, 60, 60, 60, 60, 60, 60, 166, 1, 241, 0, 0, 0, 0, 0, 0, 0, 303, Add a new file, current immtable number is 3mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 6 QP was created, QP number=0x25da
QP num to be sent = 0x25da Local LID = 0x0 Remote QP number=0x6ab Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa8ec005bb8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152630 One more local write buffer is added, now 3 total sst offset is 9626346 BloomFilter block size is 190922index block size: 36559 start of the this block is0, 30, 3, 0, 0, 0, 0, 0, 0, 41, 371, 60, 60, 60, 60, 60, 60, 60, 60, 1, 337, 350, 5, 0, 0, 0, 0, 0, 303, 77, Add a new file, current immtable number is 4mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 7 QP was created, QP number=0x25db
QP num to be sent = 0x25db Local LID = 0x0 Remote QP number=0x6ac Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa8e4005bb8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152749 One more local write buffer is added, now 3 total sst offset is 9633852 BloomFilter block size is 191050index block size: 36624 start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 40, 135, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0, Add a new file, current immtable number is 5mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 db_bench: /home/zqy2023/dLSM/util/rdma.cc:2599: int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t): Assertion `false' failed. Aborted (core dumped)
gdb db_bench core
assertion=0x55d8bd531ad1 "false", file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=<optimized out>) at assert.c:92
file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=0x55d8bd532b48 "int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t)")
at assert.c:101
qp_type="write_local_flush", send_cq=true, target_node_id=0 '\000') at /home/zqy2023/dLSM/util/rdma.cc:2599
at /home/zqy2023/dLSM/table/table_builder_computeside.cc:618
0x55d8bd582ce0 <dLSM::Env::Default()::env_container>, options=..., table_cache=0x55d8be2890e0, iter=0x7fa9b8000cc0,
meta=std::shared_ptr<dLSM::RemoteMemTableMetaData> (use count 2, weak count 0) = {...}, type=dLSM::Flush,
target_node_id=0 '\000') at /home/zqy2023/dLSM/db/memtable_list.cc:892
at /home/zqy2023/dLSM/db/db_impl.cc:791
__functor=..., __args#0=@0x7faa08a4ed20: 0x55d8be289c40) at /usr/include/c++/9/bits/std_function.h:300
at /usr/include/c++/9/bits/std_function.h:688
at /home/zqy2023/dLSM/./util/ThreadPool.h:74
Hello, do you solve the problem?
We have successfully enabled your code in a stand-alone case. But when we try to enable it between two machines, the compute node will appear bug. In function poll_completion() , compute node appears many times "number 0 got bad completion with status: 0xc, vendor syndrome: 0x81", and then memory node appears "RDMA write failed". We know that the function call order is "dLSM::DBImpl::BackgroundFlush()->dLSM::DBImpl::CompactMemTable()->dLSM::DBImpl::WriteLevel0Table()->dLSM::FlushJob::BuildTable()->dLSM::TableBuilder_ComputeSide::Finish()->dLSM::RDMA_Manager::poll_completion" How can we fix this bug?