When I set up multiple sample file dirs, I started getting errors like the one below in which:
two flushes happen around the same time
there's an error about a recording having already been deleted
then the flush fails with "no garbage row" for the same recording, and it goes into a retry loop that can never end. The in-memory state starts diverging from the database; when it's eventually killed, nothing has been committed to the database since then, although more has been written to disk and has to be thrown away on next startup.
This happens when there's no flush between Writer::collect_garbage completing and starting again. Its call to delete_garbage doesn't immediately remove it from garbage (not until the next flush). So the next call tries to unlink it again (which fails) and then calls delete_garbage on it again (which will eventually result in the transaction failing because the second delete reports the row doesn't exist).
collect_garbage happens in the post-flush notification, so there usually is a flush between each call. But the notification is asynchronous via a channel so this isn't actually guaranteed when other syncers can also fire flushes at a time it's not expecting.
The solution is pretty simple: make delete_garbage immediately remove it from garbage. Currently there's a garbage (everything in the garbage row) and a to_gc (things that should be removed from it on the next transaction). Instead, do a garbage_needs_unlink and a garbage_unlinked which are mutually exclusive.
Nov 28 14:57:26 ...: sync-/var/lib/moonfire-nvr/sample-6tb moonfire_db::writer] 6: deleting 30070712 bytes in 1 recordings (18927767 bytes needed)
Nov 28 14:57:50 ...: sync-/var/lib/moonfire-nvr/sample-6tb moonfire_db::db] Flush (why: 90 sec after start of 1 minute 5 seconds garage-main recording): added 5 recordings, deleted 1, marked 0 files GCed.
Nov 28 14:57:50 ...: sync-/var/lib/moonfire-nvr/sample-8tb moonfire_db::db] Flush (why: 90 sec after start of 1 minute 20 seconds courtyard-sub recording): added 0 recordings, deleted 0, marked 0 files GCed.
Nov 28 14:57:50 ...: sync-/var/lib/moonfire-nvr/sample-6tb moonfire_db::writer] dir: recording 6/622053 already deleted!
Nov 28 14:57:52 ...: sync-/var/lib/moonfire-nvr/sample-8tb moonfire_db::writer] 11: deleting 5348864 bytes in 1 recordings (4447102 bytes needed)
Nov 28 14:57:53 ...: sync-/var/lib/moonfire-nvr/sample-8tb moonfire_db::writer] flush failure on save for reason 90 sec after start of 1 minute 32 seconds driveway-sub recording; will retry after PT60S: ErrorMessage { msg: "no garbage row for 6/622053" }
Nov 28 14:57:53 ...: stack backtrace:
Nov 28 14:57:53 ...: 0: 0x81398f - backtrace::backtrace::libunwind::trace::h3efbb541b1fd4a89
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/libunwind.rs:53
Nov 28 14:57:53 ...: - backtrace::backtrace::trace::he86533e63137b32d
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.9/src/backtrace/mod.rs:42
Nov 28 14:57:53 ...: 1: 0x812f1f - backtrace::capture::Backtrace::new_unresolved::h1e0ea6bd8272aa64
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.9/src/capture.rs:88
Nov 28 14:57:53 ...: 2: 0x812747 - failure::backtrace::internal::InternalBacktrace::new::hf30e7136ac39a801
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/backtrace/internal.rs:44
Nov 28 14:57:53 ...: 3: 0x812a2b - failure::backtrace::Backtrace::new::hbe74d198a344a1b6
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/backtrace/mod.rs:111
Nov 28 14:57:53 ...: - <failure::backtrace::Backtrace as core::default::Default>::default::h74085fe416c7ca1f
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/backtrace/mod.rs:125
Nov 28 14:57:53 ...: 4: 0x812a43 - failure::backtrace::Backtrace::new::hbe74d198a344a1b6
Nov 28 14:57:53 ...: 5: 0x5b918b - <failure::error::error_impl::ErrorImpl as core::convert::From<F>>::from::hf5461434cf9cbad6
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/error/error_impl.rs:19
Nov 28 14:57:53 ...: - <failure::error::Error as core::convert::From<F>>::from::h9373489268132011
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/error/mod.rs:36
Nov 28 14:57:53 ...: - failure::error_message::err_msg::ha9648b9e3acef653
Nov 28 14:57:53 ...: at /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/failure-0.1.3/src/error_message.rs:12
Nov 28 14:57:53 ...: - moonfire_db::raw::mark_sample_files_deleted::hd912400c9861c3a1
Nov 28 14:57:53 ...: at db/raw.rs:297
Nov 28 14:57:53 ...: 6: 0x4b829b - moonfire_db::db::LockedDatabase::flush::h3dc2c842b26129c2
Nov 28 14:57:53 ...: at /home/slamb/git/moonfire-nvr/db/db.rs:850
Nov 28 14:57:53 ...: 7: 0x509c67 - <moonfire_db::db::DatabaseGuard<'db, C>>::flush::h7b1c01d01aa74d02
Nov 28 14:57:53 ...: at /home/slamb/git/moonfire-nvr/db/db.rs:1883
Nov 28 14:57:53 ...: - <moonfire_db::writer::Syncer<C, D>>::flush::h3b5933ac754e19e8
Nov 28 14:57:53 ...: at /home/slamb/git/moonfire-nvr/db/writer.rs:454
Nov 28 14:57:53 ...: - <moonfire_db::writer::Syncer<C, D>>::run::h65b427485c2a355b
Nov 28 14:57:53 ...: at /home/slamb/git/moonfire-nvr/db/writer.rs:365
Nov 28 14:57:53 ...: 8: 0x4a32bf - moonfire_db::writer::start_syncer::{{closure}}::h8d8d9817454357db
Nov 28 14:57:53 ...: at /home/slamb/git/moonfire-nvr/db/writer.rs:147
Nov 28 14:57:53 ...: - std::sys_common::backtrace::__rust_begin_short_backtrace::h478cfad5bec097cb
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/sys_common/backtrace.rs:136
Nov 28 14:57:53 ...: 9: 0x47a023 - std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}::h0bcfd3639b5817f7
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/thread/mod.rs:477
Nov 28 14:57:53 ...: - <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hf8f01a225dda73a6
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/panic.rs:319
Nov 28 14:57:53 ...: - std::panicking::try::do_call::h6661548dfb451858
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/panicking.rs:310
Nov 28 14:57:53 ...: 10: 0x8e220b - __rust_maybe_catch_panic
Nov 28 14:57:53 ...: at src/libpanic_unwind/lib.rs:102
Nov 28 14:57:53 ...: 11: 0x44fd3b - std::panicking::try::h5b49cbb90fd9a1f2
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/panicking.rs:289
Nov 28 14:57:53 ...: - std::panic::catch_unwind::hdb03e368bff69de4
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/panic.rs:398
Nov 28 14:57:53 ...: - std::thread::Builder::spawn_unchecked::{{closure}}::h5c6ba7cb08b38b14
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/libstd/thread/mod.rs:476
Nov 28 14:57:53 ...: - <F as alloc::boxed::FnBox<A>>::call_box::h97a0e387b9befaf7
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/liballoc/boxed.rs:673
Nov 28 14:57:53 ...: 12: 0x8dbe6f - <alloc::boxed::Box<(dyn alloc::boxed::FnBox<A, Output=R> + 'a)> as core::ops::function::FnOnce<A>>::call_once::hd1734190ba15b47e
Nov 28 14:57:53 ...: at /rustc/400c2bc5ed292f77c49693320f4eda37bb375e90/src/liballoc/boxed.rs:683
Nov 28 14:57:53 ...: - std::sys_common::thread::start_thread::h30eac6e77e888779
Nov 28 14:57:53 ...: at src/libstd/sys_common/thread.rs:24
Nov 28 14:57:53 ...: - std::sys::unix::thread::Thread::new::thread_start::h6b3c4b2ce6075f65
Nov 28 14:57:53 ...: at src/libstd/sys/unix/thread.rs:90
When I set up multiple sample file dirs, I started getting errors like the one below in which:
This happens when there's no flush between
Writer::collect_garbage
completing and starting again. Its call todelete_garbage
doesn't immediately remove it fromgarbage
(not until the next flush). So the next call tries to unlink it again (which fails) and then callsdelete_garbage
on it again (which will eventually result in the transaction failing because the seconddelete
reports the row doesn't exist).collect_garbage
happens in the post-flush notification, so there usually is a flush between each call. But the notification is asynchronous via a channel so this isn't actually guaranteed when other syncers can also fire flushes at a time it's not expecting.The solution is pretty simple: make
delete_garbage
immediately remove it fromgarbage
. Currently there's agarbage
(everything in the garbage row) and ato_gc
(things that should be removed from it on the next transaction). Instead, do agarbage_needs_unlink
and agarbage_unlinked
which are mutually exclusive.