sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.94k stars 747 forks source link

Don't let state reconstruction starve pruning #3026

Open michaelsproul opened 2 years ago

michaelsproul commented 2 years ago

Description

Our Prater nodes running with --reconstruct-historic-states sometimes run out of disk space and die because of this caveat of state reconstruction:

While reconstruction is running the node will temporarily pause migrating new data to the freezer database. This will lead to the database increasing in size temporarily (by a few GB per day) until state reconstruction completes.

Rather than requiring state reconstruction to run in one go I think we should allow the background migrator to alternate between state reconstruction and pruning tasks. This will require a bit of a refactor of the reconstruct_historic_states function, perhaps passing in a maximum number of slots to reconstruct in one batch before returning. We might also have to track the reconstruction status in the BackgroundMigrator.

michaelsproul commented 2 years ago

Thanks @tthebst for starting to look into this.

Here are a few pointers that might be useful:

There are likely many approaches that would work here, and this is just an idea. You're welcome to implement any design that you think would be appropriate :blush:

int88 commented 1 year ago

@michaelsproul Has this issue been fixed? If not, I'd like to try 😃

michaelsproul commented 1 year ago

@int88 It was implemented by @tthebst, but he abandoned the impl because he was running into strange database corruption issues similar to https://github.com/sigp/lighthouse/issues/3433 and https://github.com/sigp/lighthouse/issues/3455. So far I've never managed to reproduce those issues.

The commit is here if you're interested: https://github.com/sigp/lighthouse/pull/3206/commits/481e79289880b75e53cbfea1be07564b1b437323

That commit is adapted to work with tree-states (my mega-optimisation project that is currently broken), but we could backport it to unstable. I suspect the change doesn't meaningfully increase the chance of database corruption, as it was running fine on several tree-states nodes for several weeks.