Open michaelsproul opened 2 years ago
Thanks @tthebst for starting to look into this.
Here are a few pointers that might be useful:
Notification::Reconstruction
message at the end of run_reconstruction
whenever the reconstruction function (reconstruct_historic_states
) indicates that it has more work remaining. This would allow the background thread to interleave finalization processing and reconstruction. For example if the message queue starts with [Reconstruction, Finalization]
, we would process one reconstruction batch, push Reconstruction
to the queue, then process the finalization. Now the queue would be [Reconstruction]
and we would sit there running reconstruction batches until a new Finalization
message arrived. To make this work we'd have to remove the preferential treatment for Reconstruction
in the message de-queue logic here, and maybe re-consider the behaviour where we drain the whole channel on every iteration. That was added as an optimisation for the case where we have Finalization
notifications backed up and want to just process the most recent one. I think it would be possible to keep that optimisation, but haven't thought through the mechanics in depth.There are likely many approaches that would work here, and this is just an idea. You're welcome to implement any design that you think would be appropriate :blush:
@michaelsproul Has this issue been fixed? If not, I'd like to try 😃
@int88 It was implemented by @tthebst, but he abandoned the impl because he was running into strange database corruption issues similar to https://github.com/sigp/lighthouse/issues/3433 and https://github.com/sigp/lighthouse/issues/3455. So far I've never managed to reproduce those issues.
The commit is here if you're interested: https://github.com/sigp/lighthouse/pull/3206/commits/481e79289880b75e53cbfea1be07564b1b437323
That commit is adapted to work with tree-states
(my mega-optimisation project that is currently broken), but we could backport it to unstable
. I suspect the change doesn't meaningfully increase the chance of database corruption, as it was running fine on several tree-states
nodes for several weeks.
Description
Our Prater nodes running with
--reconstruct-historic-states
sometimes run out of disk space and die because of this caveat of state reconstruction:Rather than requiring state reconstruction to run in one go I think we should allow the background migrator to alternate between state reconstruction and pruning tasks. This will require a bit of a refactor of the
reconstruct_historic_states
function, perhaps passing in a maximum number of slots to reconstruct in one batch before returning. We might also have to track the reconstruction status in theBackgroundMigrator
.