spacemeshos / post

Spacemesh POST protocol implementation
MIT License
19 stars 20 forks source link

Check the error handling next to postdata_metadata.json #182

Closed pigmej closed 1 year ago

pigmej commented 1 year ago

Rationale

There are claims in the community that after initialization postdata_metadata.json was corrupted or empty. Some say that it happened because of running out of disk space, but some Users do not have any problems with disk space. So it seems that in case PoS initialization fails for some reason — then it breaks everything.

That means:

We need to

  1. Handle properly failures during PoS initialization: do not leave files in an inconsistent state if possible.
  2. Since there still might be I/O problems — Node should handle the case correctly. So if postdata_metadata.json is corrupted/empty, while there is no postdata_N.bin — it can just recreate everything. If there are some post data already generate — then it is a more complicated case and most likely we need the User's attention to decide what to do with it. For example, let's say we cannot write valid JSON or remove an inconsistent file because User unplugged his external hard drive — then the Node should not crash if no PoS data generated yet. Just recreated everything.
brusherru commented 1 year ago

Go through one more case reported in Discord and updated the issue.

fasmat commented 1 year ago
  • user lost nonce found during initialization
  • he needs to find new nonce which is "not that easy" case.

Both of these have been addressed already with

A corrupted postdata_metadata.json file will be prevented in the future by making updates to the file atomic (part of https://github.com/spacemeshos/post/pull/211).

poszu commented 1 year ago

@fasmat, perhaps changing the code to do atomic updates could be extracted from #211 into a separate PR as this is a trivial change and there is no point in holding it back by other unrelated changes.

lrettig commented 1 year ago

This happened to me once; for the record, #193 is not a satisfactory workaround because it takes a very, very, very, very long time for large data sizes.

fasmat commented 1 year ago

@lrettig we just recently merged https://github.com/spacemeshos/post/pull/231 which will land in the node with the next version. This will prevent postdata_metadata.json from being deleted / corrupted if the node crashes at the wrong moment.

If however it is already missing, I don't see a better / faster way to regenerate the file. #193 should already be significantly faster than a re-init because it only needs to do one pass over the data to find the nonce again. This will take at most as long as generating a proof (so at most 12 hours if your node is set up to be able to generate a proof within the cycle gap).

poszu commented 1 year ago

@lrettig, it shouldn't take that long to find the lost VRF nonce. It's basically limited by disk read speed only.

fasmat commented 1 year ago

With the atomic update of the postdata_metadata.json file now being integrated in v1.1.6 of the node I will close this issue.