Open gapisback opened 1 year ago
Have narrowed down the repro as follows: Re-applied relevant commits from agurajada/560-rf-add-bug
to commits off of /main
.
agurajada/rf-add-f0af570-baseline
: Branched off of /main
@ SHA f0af570 - Repro succeedsagurajada/rf-add-fc897e4c-fail
: Branched off of /main
@ SHA fc897e4c - Repro failsDebugging ...
Updated: (4/1/2023):
I pushed branch agurajada/rf-add-fc897e4c-fail, which adds couple more variations of the basic test case:
You have to run these with large # of inserts, say 20M e.g.
$ ./build/release/bin/unit/large_inserts_stress_test --verbose-progress --num-inserts 20000000 test_560_seq_host_endian32_key_random_5byte_values_inserts
Here's a really easy diff against main
to reproduce this:
$ git diff
diff --git a/tests/unit/splinterdb_stress_test.c b/tests/unit/splinterdb_stress_test.c
index 87cf659..32d7470 100644
--- a/tests/unit/splinterdb_stress_test.c
+++ b/tests/unit/splinterdb_stress_test.c
@@ -18,8 +18,8 @@
#include "../functional/random.h"
#include "ctest.h" // This is required for all test-case files.
-#define TEST_KEY_SIZE 20
-#define TEST_VALUE_SIZE 116
+#define TEST_KEY_SIZE 4
+#define TEST_VALUE_SIZE 6
// Function Prototypes
static void *
@@ -73,7 +73,7 @@ CTEST2(splinterdb_stress, test_random_inserts_concurrent)
ASSERT_TRUE(random_data >= 0);
worker_config wcfg = {
- .num_inserts = 1000 * 1000,
+ .num_inserts = 20 * 1000 * 1000,
.random_data = random_data,
.kvsb = data->kvsb,
};
We did some more experimentation today and it appears tuples with 4b key, 20b value will pass, but smaller will not.
Here is some more evidence of the behaviours we see for this existing stress-test when run off of /main
@SHA 89f09b3:
For a single-thread execution:
22 #define TEST_KEY_SIZE 4
23
27 #define TEST_VALUE_SIZE 16
#define num_threads 1
56 data->cfg = (splinterdb_config){.filename = TEST_DB_NAME,
57 .cache_size = 1000 * Mega,
58 .disk_size = 30000 * Mega,
59 .data_cfg = &data->default_data_config,
60 .queue_scale_percent = 100};
The test progresses to about: Thread 140002820638272 has completed 10200000 inserts
and then hangs.
The stack looks like:
Samples: 1M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 16012963972 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
46.09% libsplinterdb.so [.] clockcache_try_get_read.constprop.0
43.10% libsplinterdb.so [.] clockcache_get
5.85% ld-linux-x86-64.so.2 [.] __tls_get_addr
1.99% libsplinterdb.so [.] 0x000000000000d150
1.02% libsplinterdb.so [.] 0x000000000000cf60
0.99% libsplinterdb.so [.] rc_allocator_get_config
0.96% libsplinterdb.so [.] rc_allocator_get_config_virtual
0.00% [kernel] [k] exit_to_user_mode_loop
The same test with the same above config will pass for this combo (again single thread): TEST_KEY_SIZE 4; TEST_VALUE_SIZE 20
--- Re-run with 4-threads, key=4 bytes, value=16 bytes:
This will run for a while till here:
sdb-fdb-build:[43] $ VERBOSE=6 ./build/release/bin/unit/splinterdb_stress_test test_random_inserts_concurrent
Running 1 CTests, suite name 'splinterdb_stress', test case 'test_random_inserts_concurrent'.
TEST 1/1 splinterdb_stress:test_random_inserts_concurrent Waiting for 4 worker threads ...
Thread[0] ID=140504240404032
Thread[1] ID=140504232011328
Thread[2] ID=140504223618624
Thread[3] ID=140504215225920
Writing lots of data from thread 140504232011328
[...]
Thread 140504240404032 has completed 3300000 inserts
Thread 140504215225920 has completed 3300000 inserts
Thread 140504223618624 has completed 3600000 inserts
And then will hang in this stack:
Samples: 93K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 20538871086 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
73.76% libsplinterdb.so [.] memtable_maybe_rotate_and_get_insert_lock
11.95% libsplinterdb.so [.] clockcache_try_get_read.constprop.0
10.88% libsplinterdb.so [.] clockcache_get
1.59% ld-linux-x86-64.so.2 [.] __tls_get_addr
0.53% libsplinterdb.so [.] 0x000000000000d150
0.38% libsplinterdb.so [.] clockcache_unget
0.29% libsplinterdb.so [.] rc_allocator_get_config_virtual
0.27% libsplinterdb.so [.] 0x000000000000cf60
0.26% libsplinterdb.so [.] rc_allocator_get_config
This specific problem was encountered during benchmarking of single-client inserting 20M rows (in PG-SplinterDB integration code base).
Update (agurajada; 4/2023) After deeper investigations of variations of this repro, the basic issue seems to be that there is some instability in the library in
/main
when using small key-value pairs. After some experimentation, it appears that we can stably insert 20+M rows, using single / multiple threads, when key=4 bytes and value >= 20 bytes. For smaller k/v pair sizes, there are different forms of instabilities seen. We really need a comprehensive test-suite that can exercise these basic insert workloads for diff # of rows inserts, # of clients, and varying combinations of K/V pair sizes.This has been reproduced using standalone test script off of
/main
@ SHA 89f09b3.Branch: agurajada/560-rf-add-bug (Has another commit, which fixes another issue, partially tracked by the failures by repros attempted for issue #458. You need that fix in order to go further along to repro this failure.)
The failure is seen in this test
large_inserts_stress_test --num-inserts 20000000 test_560_seq_htobe32_key_random_6byte_values_inserts
from this branch:With release build, you will get a seg-fault a few lines later.
The test case
test_560_seq_htobe32_key_random_6byte_values_inserts()
has the exact conditions required to trigger this bug. There are other test cases in this test file that invoke different combinations of sequential / random key inserts, and they all seem to succeed.