neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.27k stars 408 forks source link

Epic: safekeeper fuzzer #547

Closed kelvich closed 6 months ago

kelvich commented 3 years ago

Standalone binary that checks all possible postgres/safekeeper histories up to a certain depth, by using actual safekeeper code and checking invariant along the way.

kelvich commented 2 years ago

Related experiments: https://github.com/neondatabase/neon/compare/main...cargo-fuzz-experiment and https://github.com/neondatabase/neon/compare/main...kelvich/struct-fuzzing and https://github.com/neondatabase/neon/compare/main...fuzz-test-safekeeper

cc @arssher @petuhovskiy @yeputons

bojanserafimov commented 2 years ago

To summarize the current experiments: cargo-fuzz (interface for libfuzzer, coverage-guided genetic algo) connected to a single inmem safekeeper is smart enough to quickly find some panics. These bugs don't cause data loss, but they allow one bad compute node to DOS the multi-tenant safekeeper, so worth solving at some point.

A more interesting fuzzer would be to connect 3 safekeepers as an inmem state machine with message delivery controlled by the fuzzer, and try to break some correctness assertions.

arssher commented 2 years ago

Perhaps I missed that, what is an example of panic?

bojanserafimov commented 2 years ago

The examples are not checked into the branch. Within a few minutes running on my laptop it found:

  1. "can't allocate that much memory" in TermHistory::from_bytes. I guess the fuzzer just passed a large value for n_entries
  2. Overflow on LSN addition. Not sure how, but I can minify the generated input if you're curious
  3. Stas found '<unnamed>' panicked at 'assertion failed: commit_lsn >= self.inmem.commit_lsn', /Users/stas/code/zenith/safekeeper/src/safekeeper.rs:745:9. Could be some unchecked overflow somewhere as well, not sure.
arssher commented 2 years ago
  1. Just backtrace would be already interesting.
  2. Hmm, not sure. Overflow should panic before failing asserts...
bojanserafimov commented 2 years ago

Backtrace for (2):

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /home/bojan/src/neondatabase/neon/libs/utils/src/lsn.rs:165:39
stack backtrace:
   0:     0x5611736ce04c - std::backtrace_rs::backtrace::libunwind::trace::hf7449935ead7573e
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x5611736ce04c - std::backtrace_rs::backtrace::trace_unsynchronized::h221aa2d88d72372a
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5611736ce04c - std::sys_common::backtrace::_print_fmt::h1c77e8983e1df895
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x5611736ce04c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd4ec41a9a6b0d22c
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x561173729d9c - core::fmt::write::h72801a82c94e6ff1
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/core/src/fmt/mod.rs:1149:17
   5:     0x5611736be7f5 - std::io::Write::write_fmt::haf74340a8cbeaa88
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/io/mod.rs:1697:15
   6:     0x5611736d1210 - std::sys_common::backtrace::_print::h2d15cd157796a64a
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x5611736d1210 - std::sys_common::backtrace::print::h52d286d22e2398eb
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x5611736d1210 - std::panicking::default_hook::{{closure}}::h6da08fba6306daf2
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/panicking.rs:211:50
   9:     0x5611736d0dbb - std::panicking::default_hook::h266f67a22e76b11a
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/panicking.rs:228:9
  10:     0x56117364b9be - libfuzzer_sys::initialize::{{closure}}::h73ccfb35526c0e6e
  11:     0x5611736d1a29 - std::panicking::rust_panic_with_hook::he55698a957f4fb6d
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/panicking.rs:610:17
  12:     0x5611736d14b2 - std::panicking::begin_panic_handler::{{closure}}::h01f453c3ac181895
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/panicking.rs:500:13
  13:     0x5611736ce4f4 - std::sys_common::backtrace::__rust_end_short_backtrace::h675d77c6e5a3926d
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/sys_common/backtrace.rs:139:18
  14:     0x5611736d1449 - rust_begin_unwind
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/std/src/panicking.rs:498:5
  15:     0x5611712df941 - core::panicking::panic_fmt::h7b8580d81fcbbacd
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/core/src/panicking.rs:107:14
  16:     0x5611712df88d - core::panicking::panic::h50b51d19800453c0
                               at /rustc/e6b883c74f49f32cb5d1cbad3457f2b8805a4a38/library/core/src/panicking.rs:48:5
  17:     0x561171a8d01d - <utils::lsn::Lsn as core::ops::arith::Add<u64>>::add::h788cc3fd8a06afb8
  18:     0x56117165c445 - safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::handle_append_request::hd750ea1d9c393dc2
  19:     0x561171653198 - safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::process_msg::h620208424a17d469
  20:     0x56117165dac5 - safekeeper::safekeeper::try_fuzz::h84ab0a9402590566
  21:     0x56117165e0f5 - safekeeper::safekeeper::fuzz::hb5eaa8cae2c576f2
  22:     0x56117364b759 - LLVMFuzzerTestOneInput
  23:     0x561173667ee6 - _ZN6fuzzer6Fuzzer15ExecuteCallbackEPKhm
  24:     0x56117365b1ff - _ZN6fuzzer10RunOneTestEPNS_6FuzzerEPKcm
  25:     0x56117365f05b - _ZN6fuzzer12FuzzerDriverEPiPPPcPFiPKhmE
  26:     0x5611712e00d7 - main
  27:     0x7f063e01afd0 - <unknown>
  28:     0x7f063e01b07d - __libc_start_main
  29:     0x5611712e0275 - _start
  30:                0x0 - <unknown>
==1097335== ERROR: libFuzzer: deadly signal
NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal
kelvich commented 2 years ago

Overflow on LSN addition.

I think fuzzer just takes lsn values around UINT_MAX. I've found one place of form lsn1 + delta < lsn2 and changed it to lsn2 - lsn1 > delta, but that was not enough to fix it -- may there are other places like that (for some reason I can't see line numbers in backtrace even in a debug mode).

kelvich commented 2 years ago

But generally, with this struct-based fuzzing we can set up several safekeeper instances in test and check Paxos invariants over the committed history.

Downside is that this test takes 3-4 minutes to compile :)

arssher commented 1 year ago

Previous attempts worked only on safekeeper (rust part) code, but we want both compute and safekeepers parts involved, so hijacking this issue for the next iteration, cc @petuhovskiy