ruihong123 / dLSM

dLSM: An LSM-Based Index for RDMA-Enabled Memory Disaggregation
BSD 3-Clause "New" or "Revised" License
28 stars 7 forks source link

How to enable the code on multiple servers? #3

Open SmallCoal2001 opened 1 year ago

SmallCoal2001 commented 1 year ago

We have successfully enabled your code in a stand-alone case. But when we try to enable it between two machines, the compute node will appear bug. In function poll_completion() , compute node appears many times "number 0 got bad completion with status: 0xc, vendor syndrome: 0x81", and then memory node appears "RDMA write failed". We know that the function call order is "dLSM::DBImpl::BackgroundFlush()->dLSM::DBImpl::CompactMemTable()->dLSM::DBImpl::WriteLevel0Table()->dLSM::FlushJob::BuildTable()->dLSM::TableBuilder_ComputeSide::Finish()->dLSM::RDMA_Manager::poll_completion" How can we fix this bug?

ruihong123 commented 1 year ago

Please show me the whole log of the error, maybe I can figure out what was happening.

SmallCoal2001 commented 1 year ago

Mark: valgrind socket info1 searching for IB devices in host found 2 device(s) device not specified, using first one found: mlx5_0 New MR was registered with addr=0x7faa0b0e1010, lkey=0x1825e4, rkey=0x1825e4, flags=0xf, size=10240000, total registered size is 0 New MR was registered with addr=0x7faa0a71c010, lkey=0x17fcbc, rkey=0x17fcbc, flags=0xf, size=10240000, total registered size is 10240000 SST buffer, send&receive buffer were registered with a maximum outstanding wr number is32768 maximum query pair number is131072 maximum completion queue number is16777216 maximum memory region number is16777216 maximum memory region size is18446744073709551615 connect to node id 0QP was created, QP number=0x25d7

Local LID = 0x0 total bytes: 23read byte: 23Remote QP number = 0x6a8 Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7faa040022b8 state was change to RTS total bytes: 1read byte: 1Finish the connection with node 0 New MR was registered with addr=0x7fa9c3fff010, lkey=0xac17, rkey=0xac17, flags=0xf, size=1073741824, total registered size is 20480000 dLSM: version 1.22 Date: Fri Aug 18 03:19:54 2023 Start to sync options client handling thread CPU: 80 * Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz CPUCache: Keys: 16 bytes each Values: 32 bytes each (16 bytes after compression) Entries: 10000000 RawSize: 457.8 MB (estimated) FileSize: 305.2 MB (estimated) WARNING: Optimization is disabled: benchmarks unnecessarily slow WARNING: Assertions are enabled; benchmarks unnecessarily slow WARNING: Snappy compression is not enabled

DBImpl start New MR was registered with addr=0x7fa9c1ffe010, lkey=0x33fff, rkey=0x33fff, flags=0xf, size=33554432, total registered size is 1094221824 Memory used up, Initially, allocate new one, memory pool is Version_edit, total memory this pool is 1 RDMA write successfully communication thread created DBImpl finished level 0 file equals 0 marker Version get garbage collected version garbage collected. level 0 file equals 0 marker Version get garbage collected version garbage collected. May be schedule a background task! DBImpl deallocated May be schedule a background task! May be schedule a background task! Version get garbage collected version garbage collected. remained versuins number is 199344864version garbage collected. Memtable 0x55d8be288600 deallocated Total number of entries within the cahce is 0DBImpl start RDMA write successfully communication thread created DBImpl finished level 0 file equals 0 marker Version get garbage collected version garbage collected. level 0 file equals 0 marker Version get garbage collected version garbage collected. May be schedule a background task! The second open finished. The benchmark start. validation write finished start front-end threads Wait for thread start total bytes: 1read byte: 1sync wait time is 227873Threads start to run Add a new file, current immtable number is 1mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 4 New MR was registered with addr=0x7fa973fff010, lkey=0xcb18, rkey=0xcb18, flags=0xf, size=1073741824, total registered size is 1127776256 Memory used up, Initially, allocate new one, memory pool is FlushBuffer, total memory this pool is 1 New MR was registered with addr=0x7fa933ffe010, lkey=0x1c919, rkey=0x1c919, flags=0xf, size=1073741824, total registered size is 2201518080 Memory used up, Initially, allocate new one, memory pool is IndexChunk, total memory this pool is 1 Add a new file, current immtable number is 2mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 5 New MR was registered with addr=0x7fa8f3ffd010, lkey=0x20a1a, rkey=0x20a1a, flags=0xf, size=1073741824, total registered size is 3275259904 Memory used up, Initially, allocate new one, memory pool is FilterChunk, total memory this pool is 1 Remote memory registeration, size: 1073741824 polled reply bufferr QP was created, QP number=0x25d8

QP num to be sent = 0x25d8 Local LID = 0x0 QP was created, QP number=0x25d9 Polling reply buffer QP num to be sent = 0x25d9 Local LID = 0x0uffer Remote QP number=0x6a9 Remote LID = 0x0ffer Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa9b8005bd8 state was change to RTS Remote QP number=0x6aa Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa9b40088f8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152656 One more local write buffer is added, now 3 total sst offset is 9627984 For flush, Total number of key touched is 153846, KV left is 152722 One more local write buffer is added, now 3 total sst offset is 9618133 BloomFilter block size is 190922index block size: 36543 start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 43, 210, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0, BloomFilter block size is 190922index block size: 36444 start of the this block is0, 31, 3, 0, 0, 0, 0, 0, 0, 0, 241, 60, 60, 60, 60, 60, 60, 60, 60, 166, 1, 241, 0, 0, 0, 0, 0, 0, 0, 303, Add a new file, current immtable number is 3mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 6 QP was created, QP number=0x25da

QP num to be sent = 0x25da Local LID = 0x0 Remote QP number=0x6ab Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa8ec005bb8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152630 One more local write buffer is added, now 3 total sst offset is 9626346 BloomFilter block size is 190922index block size: 36559 start of the this block is0, 30, 3, 0, 0, 0, 0, 0, 0, 41, 371, 60, 60, 60, 60, 60, 60, 60, 60, 1, 337, 350, 5, 0, 0, 0, 0, 0, 303, 77, Add a new file, current immtable number is 4mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! table picked is 1picked metable number is 1new file number for flushing is 7 QP was created, QP number=0x25db

QP num to be sent = 0x25db Local LID = 0x0 Remote QP number=0x6ac Remote LID = 0x0 Remote GID =fe:80:00:00:00:00:00:00:12:70:fd:ff:fe:2f:8f:b4 QP 0x7fa8e4005bb8 state was change to RTS For flush, Total number of key touched is 153846, KV left is 152749 One more local write buffer is added, now 3 total sst offset is 9633852 BloomFilter block size is 191050index block size: 36624 start of the this block is0, 20, 3, 0, 0, 0, 0, 0, 0, 40, 135, 1, 377, 377, 377, 377, 377, 377, 377, 0, 303, 77, 0, 20, 4, 0, 0, 0, 0, 0, Add a new file, current immtable number is 5mark in the ref May be schedule a background task! flushing thread pool task queue length 0 Schedule a flushing ! number 0 got bad completion with status: 0xc, vendor syndrome: 0x81 db_bench: /home/zqy2023/dLSM/util/rdma.cc:2599: int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t): Assertion `false' failed. Aborted (core dumped)

gdb db_bench core

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50

1 0x00007faa0bc51859 in __GI_abort () at abort.c:79

2 0x00007faa0bc51729 in __assert_fail_base (fmt=0x7faa0bde7588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",

assertion=0x55d8bd531ad1 "false", file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=<optimized out>) at assert.c:92

3 0x00007faa0bc62fd6 in __GI___assert_fail (assertion=0x55d8bd531ad1 "false",

file=0x55d8bd531c68 "/home/zqy2023/dLSM/util/rdma.cc", line=2599,
function=0x55d8bd532b48 "int dLSM::RDMA_Manager::poll_completion(ibv_wc*, int, std::string, bool, uint8_t)")
at assert.c:101

4 0x000055d8bd4f9b34 in dLSM::RDMA_Manager::poll_completion (this=0x55d8bd9f5170, wc_p=0x7faa08a4e750, num_entries=4,

qp_type="write_local_flush", send_cq=true, target_node_id=0 '\000') at /home/zqy2023/dLSM/util/rdma.cc:2599

5 0x000055d8bd4e1c9d in dLSM::TableBuilder_ComputeSide::Finish (this=0x7fa9b8000ca0)

at /home/zqy2023/dLSM/table/table_builder_computeside.cc:618

6 0x000055d8bd49f3cc in dLSM::FlushJob::BuildTable (this=0x7faa08a4eb30, dbname="/tmp/dLSMtest-1010/dbbench", env=

0x55d8bd582ce0 <dLSM::Env::Default()::env_container>, options=..., table_cache=0x55d8be2890e0, iter=0x7fa9b8000cc0,
meta=std::shared_ptr<dLSM::RemoteMemTableMetaData> (use count 2, weak count 0) = {...}, type=dLSM::Flush,
target_node_id=0 '\000') at /home/zqy2023/dLSM/db/memtable_list.cc:892

7 0x000055d8bd46cd2e in dLSM::DBImpl::WriteLevel0Table (this=0x55d8be288600, job=0x7faa08a4eb30, edit=0x7faa08a4ebc0)

at /home/zqy2023/dLSM/db/db_impl.cc:791

8 0x000055d8bd46cffd in dLSM::DBImpl::CompactMemTable (this=0x55d8be288600) at /home/zqy2023/dLSM/db/db_impl.cc:997

9 0x000055d8bd46da4c in dLSM::DBImpl::BackgroundFlush (this=0x55d8be288600, p=0x0) at /home/zqy2023/dLSM/db/db_impl.cc:1220

10 0x000055d8bd46d902 in dLSM::DBImpl::BGWork_Flush (thread_arg=0x55d8be289c40) at /home/zqy2023/dLSM/db/db_impl.cc:1182

11 0x000055d8bd4ce2c6 in std::_Function_handler<void (void), void ()(void)>::_M_invoke(std::_Any_data const&, void&&) (

__functor=..., __args#0=@0x7faa08a4ed20: 0x55d8be289c40) at /usr/include/c++/9/bits/std_function.h:300

12 0x000055d8bd4cb9d9 in std::function<void (void)>::operator()(void) const (this=0x7faa08a4ed90, __args#0=0x55d8be289c40)

at /usr/include/c++/9/bits/std_function.h:688

13 0x000055d8bd4c9fa4 in dLSM::ThreadPool::BGThread (this=0x55d8bd582dc0 <dLSM::Env::Default()::env_container+224>)

at /home/zqy2023/dLSM/./util/ThreadPool.h:74

14 0x000055d8bd4d5282 in std::__invoke_impl<void, void (dLSM::ThreadPool::)(), dLSM::ThreadPool>

dongzhangqi7 commented 10 months ago

Hello, do you solve the problem?