Open phyro opened 3 years ago
This could definitely use a CI test that exposes the problem; that might be a good place to start.
I think we may have actually tracked down the NotFoundErr("BLOCK HEADER:
error when we were frantically implementing the "rewind bad block" logic for #3605.
Or at least one situation where it can surface. Its related to the "sync MMR" and I believe fixed here -
tl;dr the sync MMR gets "out of sync" relative to the header MMR and ends up in a state where it continues to refer to a header that no longer exists.
I'm not sure this actually resolves the issue identified in this PR. Its also not technically data corruption as we can recover from it and its more a case of us just handling it badly on startup.
To be clear we definitely do still have situations where it is possible to legitimately corrupt the data on disk, but this missing header is (I believe) a less severe problem that can be handled more robustly during node startup (and I think we do that now with the fix linked above).
- Remove the .grin folder if you have it
- Open grin node and let it start sync
- Close it after 20 seconds or so by clicking 'X'
So yes this is happening during "header sync" so in this specific case we are not deleting a header for it go go missing, but we are shutting the node down during the period where it is writing the sync MMR files to disk, but has not yet committed the batch of headers to the db. So on next startup the sync MMR is out beyond the headers in the db and the "sync head" points to a non-existent header. We want to reset the sync MMR to an earlier state which I believe we now correctly do in this situation, we just need to get past the initial startup where it makes an assumption that the header exists.
Describe the bug Running the node on Windows and closing by clicking on "X" in the top right corner of the windows app has a high probability of corrupting the database.
To Reproduce
It doesn't always corrupt the data, but every 3rd try it usually ends in a bad state from which it can't recover.
Relevant Information Logs when trying to run the node after a corruption occured.
Screenshots /
Desktop (please complete the following information): I've noticed this on Windows.
Additional context
It works if you quit the "Grin way" by pressing 'q' and waiting for the cleanup prior to shutdown.
Here's probably some context around the problem.
https://stackoverflow.com/questions/26658707/windows-console-application-signal-for-closing-event
I believe the ctrlc package we use may not support the SIGBREAK signal. Perhaps using a different library to catch these signals and then reacting on it also in this case would solve the issue, but I didn't dive too deep.