mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 992 forks source link

gracefully handle out of disk space failures #3425

Open jlopp opened 3 years ago

jlopp commented 3 years ago

Describe the bug grin fails to recover gracefully from a crash resulting from running out of disk space.

To Reproduce Steps to reproduce the behavior:

  1. Run grin
  2. Run out of disk space; grin panics
  3. Increase disk space and restart grin

Relevant Information

20200817 16:51:27.347 INFO grin - Using configuration file at /home/jameson/.grin/main/grin-server.toml
20200817 16:51:27.347 INFO grin - This is Grin version 4.0.2 (git v4.0.2), built for x86_64-unknown-linux-gnu by rustc 1.45.2 (d3fb005a3 2020-07-31).
20200817 16:51:27.347 DEBUG grin - Built with profile "release", features "".
20200817 16:51:27.347 INFO grin - Chain: Mainnet
20200817 16:51:27.347 INFO grin - Feature: NRD kernel enabled: false
20200817 16:51:27.347 WARN grin::cmd::server - Starting GRIN in UI mode...
20200817 16:51:27.354 INFO grin_servers::grin::server - Starting server, genesis block: 40adad0aec27
20200817 16:51:27.358 DEBUG grin_store::lmdb - DB Mapsize for /home/jameson/.grin/main/chain_data/lmdb is 549755813888
20200817 16:51:27.431 DEBUG grin_store::leaf_set - bitmap 162820 pos (315706 bytes)
20200817 16:51:29.779 DEBUG grin_store::prune_list - bitmap 478437 pos (718704 bytes), pruned_cache 6843301 pos (772299 bytes), shift_cache 478437, leaf_shift_cache 478437
20200817 16:51:29.920 DEBUG grin_store::leaf_set - bitmap 162820 pos (315706 bytes)
20200817 16:51:32.365 DEBUG grin_store::prune_list - bitmap 478437 pos (718704 bytes), pruned_cache 6843301 pos (772299 bytes), shift_cache 478437, leaf_shift_cache 478437
20200817 16:51:32.407 DEBUG grin_chain::txhashset::bitmap_accumulator - applied 3777 chunks from idx 0 to idx 3776 (41ms)
20200817 16:51:34.074 DEBUG grin_chain::txhashset::txhashset - attempting to open kernel PMMR using ProtocolVersion(2) - FAIL (verify failed)
20200817 16:51:34.117 DEBUG grin_chain::txhashset::txhashset - attempting to open kernel PMMR using ProtocolVersion(1) - SUCCESS
20200817 16:51:34.325 ERROR grin_util::logger - 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Chain(Error { inner: 

Other Error: failed to find head hash })': src/bin/cmd/server.rs:48   0: grin_util::logger::send_panic_to_log::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:490
   2: rust_begin_unwind
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:388
   3: core::panicking::panic_fmt
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libcore/panicking.rs:101
   4: core::option::expect_none_failed
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libcore/option.rs:1272
   5: grin::cmd::server::start_server_tui
   6: grin::cmd::server::server_command
   7: grin::real_main
   8: grin::main
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal::{{closure}}
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/rt.rs:52
      std::panicking::try::do_call
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:297
      std::panicking::try
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:274
      std::panic::catch_unwind
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panic.rs:394
      std::rt::lang_start_internal
             at /rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/rt.rs:51
  11: main
  12: __libc_start_main
  13: _start

Desktop (please complete the following information):

I ran grin --clean and it appears to have wiped all 6 GB of chain data; the node is now resyncing from genesis.

antiochp commented 3 years ago

Hey @jlopp. Thanks for reporting this.

We have a couple of "known" edge cases where file corruption can occur on non-clean shutdown. Running out of disk space is likely to exercise at least one of those.

I'd like to take another look at how we handle writing files to disk (these are the global MMMR files) if we can get some time to do so. Hopefully we can get this into a more robust state prior to the final scheduled hardfork early next year.

antiochp commented 3 years ago

Related - https://github.com/mimblewimble/grin/pull/3266 Also related - https://github.com/mimblewimble/grin/issues/3352