mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 990 forks source link

[Windows] Ungraceful shutdown of node corrupts the database #3608

Open phyro opened 3 years ago

phyro commented 3 years ago

Describe the bug Running the node on Windows and closing by clicking on "X" in the top right corner of the windows app has a high probability of corrupting the database.

To Reproduce

  1. Download the Window grin-node release
  2. Remove the .grin folder if you have it
  3. Open grin node and let it start sync
  4. Close it after 20 seconds or so by clicking 'X'

It doesn't always corrupt the data, but every 3rd try it usually ends in a bad state from which it can't recover.

Relevant Information Logs when trying to run the node after a corruption occured.

20210109 18:59:05.601 INFO grin_util::logger - log4rs is initialized, file level: Info, stdout level: Warn, min. level: Info
20210109 18:59:05.603 INFO grin - Using configuration file at C:\Users\phyro\.grin\main\grin-server.toml
20210109 18:59:05.603 INFO grin - This is Grin version 5.0.1 (git v5.0.1), built for x86_64-pc-windows-msvc by rustc 1.48.0 (7eac88abb 2020-11-16).
20210109 18:59:05.603 INFO grin - Chain: Mainnet
20210109 18:59:05.603 INFO grin - Accept Fee Base: 500000
20210109 18:59:05.603 INFO grin - Future Time Limit: 300
20210109 18:59:05.603 INFO grin - Feature: NRD kernel enabled: false
20210109 18:59:05.603 WARN grin::cmd::server - Starting GRIN in UI mode...
20210109 18:59:05.603 INFO grin_servers::grin::server - Starting server, genesis block: 40adad0aec27
20210109 18:59:05.779 ERROR grin_util::logger - 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Chain(Error { inner: NotFoundErr("BLOCK HEADER: 000007f722ca")

Store Error: NotFoundErr("BLOCK HEADER: 000007f722ca"), reason: DB Not Found Error: BLOCK HEADER: 000007f722ca })': src/bin\cmd\server.rs:58   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: BaseThreadInitThunk
  19: RtlUserThreadStart

Screenshots /

Desktop (please complete the following information): I've noticed this on Windows.

Additional context

It works if you quit the "Grin way" by pressing 'q' and waiting for the cleanup prior to shutdown.

Here's probably some context around the problem.

https://stackoverflow.com/questions/26658707/windows-console-application-signal-for-closing-event

I believe the ctrlc package we use may not support the SIGBREAK signal. Perhaps using a different library to catch these signals and then reacting on it also in this case would solve the issue, but I didn't dive too deep.

trevyn commented 3 years ago

This could definitely use a CI test that exposes the problem; that might be a good place to start.

antiochp commented 3 years ago

I think we may have actually tracked down the NotFoundErr("BLOCK HEADER: error when we were frantically implementing the "rewind bad block" logic for #3605.

Or at least one situation where it can surface. Its related to the "sync MMR" and I believe fixed here -

https://github.com/mimblewimble/grin/pull/3603/files#diff-29a4dbf8cadda7c8ed1b4e35a237dfebe170b3703d7af2c78c065ae65ab04100

tl;dr the sync MMR gets "out of sync" relative to the header MMR and ends up in a state where it continues to refer to a header that no longer exists.

I'm not sure this actually resolves the issue identified in this PR. Its also not technically data corruption as we can recover from it and its more a case of us just handling it badly on startup.

To be clear we definitely do still have situations where it is possible to legitimately corrupt the data on disk, but this missing header is (I believe) a less severe problem that can be handled more robustly during node startup (and I think we do that now with the fix linked above).

antiochp commented 3 years ago
  • Remove the .grin folder if you have it
  • Open grin node and let it start sync
  • Close it after 20 seconds or so by clicking 'X'

So yes this is happening during "header sync" so in this specific case we are not deleting a header for it go go missing, but we are shutting the node down during the period where it is writing the sync MMR files to disk, but has not yet committed the batch of headers to the db. So on next startup the sync MMR is out beyond the headers in the db and the "sync head" points to a non-existent header. We want to reset the sync MMR to an earlier state which I believe we now correctly do in this situation, we just need to get past the initial startup where it makes an assumption that the header exists.