Don't let state reconstruction starve pruning

michaelsproul commented 2 years ago

Description

Our Prater nodes running with --reconstruct-historic-states sometimes run out of disk space and die because of this caveat of state reconstruction:

While reconstruction is running the node will temporarily pause migrating new data to the freezer database. This will lead to the database increasing in size temporarily (by a few GB per day) until state reconstruction completes.

Rather than requiring state reconstruction to run in one go I think we should allow the background migrator to alternate between state reconstruction and pruning tasks. This will require a bit of a refactor of the reconstruct_historic_states function, perhaps passing in a maximum number of slots to reconstruct in one batch before returning. We might also have to track the reconstruction status in the BackgroundMigrator.

michaelsproul commented 2 years ago

Thanks @tthebst for starting to look into this.

Here are a few pointers that might be useful:

The code for the state reconstruction lives in this function: https://github.com/sigp/lighthouse/blob/79db2d4deb6a47947699d8a4a39347c19ee6e5d6/beacon_node/store/src/reconstruct.rs#L18 It will probably need to be modified to take an argument like the "maximum number of slots to reconstruct in one pass". For efficiency it should probably be a multiple of the slots-per-restore-point (maybe 8192, which is a multiple of every SPRP).
The code for driving the state reconstruction and the finalization migration lives here: https://github.com/sigp/lighthouse/blob/60449849e22351b99e6fbcefc14dbf256d7609af/beacon_node/beacon_chain/src/migrate.rs#L137-L143 One way to implement the batching for the async migrator would be to re-send the Notification::Reconstruction message at the end of run_reconstruction whenever the reconstruction function (reconstruct_historic_states) indicates that it has more work remaining. This would allow the background thread to interleave finalization processing and reconstruction. For example if the message queue starts with [Reconstruction, Finalization], we would process one reconstruction batch, push Reconstruction to the queue, then process the finalization. Now the queue would be [Reconstruction] and we would sit there running reconstruction batches until a new Finalization message arrived. To make this work we'd have to remove the preferential treatment for Reconstruction in the message de-queue logic here, and maybe re-consider the behaviour where we drain the whole channel on every iteration. That was added as an optimisation for the case where we have Finalization notifications backed up and want to just process the most recent one. I think it would be possible to keep that optimisation, but haven't thought through the mechanics in depth.

There are likely many approaches that would work here, and this is just an idea. You're welcome to implement any design that you think would be appropriate :blush:

int88 commented 1 year ago

@michaelsproul Has this issue been fixed? If not, I'd like to try 😃

michaelsproul commented 1 year ago

@int88 It was implemented by @tthebst, but he abandoned the impl because he was running into strange database corruption issues similar to https://github.com/sigp/lighthouse/issues/3433 and https://github.com/sigp/lighthouse/issues/3455. So far I've never managed to reproduce those issues.

The commit is here if you're interested: https://github.com/sigp/lighthouse/pull/3206/commits/481e79289880b75e53cbfea1be07564b1b437323

That commit is adapted to work with tree-states (my mega-optimisation project that is currently broken), but we could backport it to unstable. I suspect the change doesn't meaningfully increase the chance of database corruption, as it was running fine on several tree-states nodes for several weeks.

sigp / lighthouse

Don't let state reconstruction starve pruning #3026

Description