vmware / splinterdb

High Performance Embedded Key-Value Store
https://splinterdb.org
Apache License 2.0
682 stars 57 forks source link

driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100 fails silently. Debugger reports SIGKILL. #306

Open gapisback opened 2 years ago

gapisback commented 2 years ago

On /main, this fails silently: Hard to know where it's failing:

Fusion-LocalVM:[63] $ ./bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
./bin/driver_test: splinterdb_build_version a2768f5d-dirty
Dispatch test cache_test

Started cache_test!!
fingerprint_size: 27
cache_test: async test started with 1+0 threads (ws=10%)
cache_test: 10485760 pages allocated
test  36% completeKilled

Same silent failure in DEBUG build. Under the debugger, it reports that program terminated with SIGKILL ... Not sure where this is happening. Could not get a backtrace.

(gdb) run cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
Starting program: /home/agurajada/Code/splinterdb/bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
/home/agurajada/Code/splinterdb/bin/driver_test: splinterdb_build_version a2768f5d-dirty
Dispatch test cache_test

Started cache_test!!
fingerprint_size: 27
cache_test: async test started with 1+0 threads (ws=10%)
cache_test: 10485760 pages allocated
[New Thread 0x7ffade44a700 (LWP 13303)]
test  36% complete[Thread 0x7ffff7b94740 (LWP 13223) exited]

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
rosenhouse commented 2 years ago

On my Nimbus VM with 130GB RAM, I cannot reproduce this error.

I suggest you try re-running with a smaller cache -- if your VM has less than 20GB RAM, then this is likely an out of memory error.

gapisback commented 2 years ago

@rosenhouse - My nimbus-Vm has 64 GB RAM, and the test should be running with default cache size, which is --cache-capacity-gib (1).

The test itself is not fully reporting what cache it was using, and I'm assuming it's using the default.

My issue is not just that the test ran into an OOM (if that). I am trying to see if we can chase down these silent errors, and better report things like SIGKILLs.

rosenhouse commented 2 years ago

the test should be running with default cache size, which is --cache-capacity-gib (1).

But in the command lines you showed above, you have

 --cache-capacity-gib 20

I agree it would be good to improve the error message around this, and especially to perhaps check for available RAM before trying to set up a 20GB cache.

gapisback commented 2 years ago

Check OS-level 'free' and dmsg to see if things are getting killed. OOM-killer is probably running at kernel, when system is running out of memory. Probably a test-config issue.

carlosgarciaalvarado commented 2 years ago

See https://github.com/vmware/splinterdb/issues/313 as another repo.

gapisback commented 2 years ago

Update: came back to repro this.

On Fusion-VM, which has 15 GiB configured, the following execution fails after a while with the SIGKILL. So, the theory that we probably had OS-OOM killer kicking up causing the SIGKILL is the reason why this fails.

Fusion-LocalVM:[301] $ ./bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
./bin/driver_test: splinterdb_build_version 4a9d096f
Dispatch test cache_test

Started cache_test async performance.
fingerprint_size: 27
cache_test: async test started with 1 reader, 0 writer threads (working set=10%)
cache_test: 10485760 pages allocated
test  28% completeKilled

In parallel, the OS-memory usage monitoring script was run, and it reported:

Fusion-LocalVM:[564] $ scripts/monOSmem.sh driver_test 0 "cache_test --async"
---- OS-Memory usage report ----------------------------------------------------
monOSmem.sh: Mon Mar  7 10:00:04 PST 2022 Monitor OS-memory usage for process(es) 'driver_test', args 'cache_test --async' ...
158186 ./bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
[...]
--- HWM of OS-Memory usage (GiB) for OS-pid=151706
158186 14 [at loopCtr = 26 of 27]
              total        used        free      shared  buff/cache   available
Mem:             15          14           0           0           1           1
---- ---------------------------------------------------------------------------

Workaround is to run this case with smaller memory configuration, and enable it in nightly test runs.