Open gapisback opened 2 years ago
On my Nimbus VM with 130GB RAM, I cannot reproduce this error.
I suggest you try re-running with a smaller cache -- if your VM has less than 20GB RAM, then this is likely an out of memory error.
@rosenhouse - My nimbus-Vm has 64 GB RAM, and the test should be running with default cache size, which is --cache-capacity-gib (1)
.
The test itself is not fully reporting what cache it was using, and I'm assuming it's using the default.
My issue is not just that the test ran into an OOM (if that). I am trying to see if we can chase down these silent errors, and better report things like SIGKILL
s.
the test should be running with default cache size, which is --cache-capacity-gib (1).
But in the command lines you showed above, you have
--cache-capacity-gib 20
I agree it would be good to improve the error message around this, and especially to perhaps check for available RAM before trying to set up a 20GB cache.
Check OS-level 'free' and dmsg to see if things are getting killed. OOM-killer is probably running at kernel, when system is running out of memory. Probably a test-config issue.
See https://github.com/vmware/splinterdb/issues/313 as another repo.
Update: came back to repro this.
On Fusion-VM, which has 15 GiB configured, the following execution fails after a while with the SIGKILL. So, the theory that we probably had OS-OOM killer kicking up causing the SIGKILL is the reason why this fails.
Fusion-LocalVM:[301] $ ./bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
./bin/driver_test: splinterdb_build_version 4a9d096f
Dispatch test cache_test
Started cache_test async performance.
fingerprint_size: 27
cache_test: async test started with 1 reader, 0 writer threads (working set=10%)
cache_test: 10485760 pages allocated
test 28% completeKilled
In parallel, the OS-memory usage monitoring script was run, and it reported:
Fusion-LocalVM:[564] $ scripts/monOSmem.sh driver_test 0 "cache_test --async"
---- OS-Memory usage report ----------------------------------------------------
monOSmem.sh: Mon Mar 7 10:00:04 PST 2022 Monitor OS-memory usage for process(es) 'driver_test', args 'cache_test --async' ...
158186 ./bin/driver_test cache_test --async --cache-capacity-gib 20 --db-capacity-gib 100
[...]
--- HWM of OS-Memory usage (GiB) for OS-pid=151706
158186 14 [at loopCtr = 26 of 27]
total used free shared buff/cache available
Mem: 15 14 0 0 1 1
---- ---------------------------------------------------------------------------
Workaround is to run this case with smaller memory configuration, and enable it in nightly test runs.
On /main, this fails silently: Hard to know where it's failing:
Same silent failure in DEBUG build. Under the debugger, it reports that program terminated with SIGKILL ... Not sure where this is happening. Could not get a backtrace.