Open gapisback opened 2 years ago
Updates: Curiously, it does not seem to be a case of out-of-disk space error. See this:
sdb-oss-test-vm:[12] $ df -kh .
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 128G 44G 85G 34% /
sdb-oss-test-vm:[13] $ du -sh *.db
128K pll-perf.async.db
And the device specified to run the test --db-location pll-perf.async.db
is only 128K, so nothing much has happened by way of allocation to make this file bigger.
I suspect this is some sort of deadlock. We haven't exercised the async lookup code that much recently, and I bet there's something there.
Bugs with async lookups and bg threads like this one I think are non-critical. We should document these features as experimental until we can work out the bugs.
I will add it to our limitations doc.
Update: Attempted to repro this on /main
@ SHA 4b6e0b1. Some of the cmdline args used in the initial test-case call are not supported any more. These are: --num-bg-threads 20 --max-async-inflight 10 --num-pthreads 20
. These seem to have been reworked / removed in past change sets.
There seems to still be a problem. Tried to re-run the following with increased --cache-capacity-gib
, but all 10 cores on this VM are pegged at 100%. Here are the runs I tried:
545 ./build/release/bin/driver_test splinter_test --parallel-perf --num-normal-bg-threads 10 --num-memtable-bg-threads 10 --max-async-inflight 10 --num-pthreads 20 --db-capacity-gib 60 --db-location pll-perf.async.db
546 ./build/release/bin/driver_test splinter_test --parallel-perf --num-normal-bg-threads 10 --num-memtable-bg-threads 10 --num-pthreads 20 --db-capacity-gib 60 --db-location pll-perf.async.db
547 ./build/release/bin/driver_test splinter_test --parallel-perf --num-normal-bg-threads 10 --num-memtable-bg-threads 10 --db-capacity-gib 60 --db-location pll-perf.async.db
548 ./build/release/bin/driver_test splinter_test --parallel-perf --num-normal-bg-threads 10 --num-memtable-bg-threads 10 --db-capacity-gib 60 --db-location pll-perf.async.db --cache-capacity-gib 2
549 ./build/release/bin/driver_test splinter_test --parallel-perf --num-normal-bg-threads 10 --num-memtable-bg-threads 10 --db-capacity-gib 60 --db-location pll-perf.async.db --cache-capacity-gib 4
The issue may be one of incorrect --cmd-line args configuring threads and so on. Still -- it's odd that things are pegged and the test seems to not make any progress. (Just tried for a few mins.)
Here's a perf top
stack when this run happened:
Samples: 5M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 1845764265 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
90.38% libsplinterdb.so [.] clockcache_get
9.08% libsplinterdb.so [.] clockcache_get_write
0.32% libsplinterdb.so [.] clockcache_try_get_read.constprop.0
0.15% ld-linux-x86-64.so.2 [.] __tls_get_addr
0.04% libsplinterdb.so [.] 0x000000000000d140
0.01% [kernel] [k] zap_pte_range
0.00% [kernel] [k] free_unref_page_list
0.00% [kernel] [k] __lock_text_start
This is on /main. An invocation of the following multi-threaded performance test seems to run endlessly, and made no progress when left to run overnight. Nearly 20 cores are running at 100% CPU on this 32-core VM.
Some interesting stacks of few pids are shown below. (This is an OPT-build.)
Some seconds later it changed to:
Stack of another process: