Create a new PlatformStatusManager, which accepts notifications of events which may affect platform status, and pushes them onto a queue to be processed
The PlatformStatusManager will have a method addStatusEvent, which pushes a StatusEvent enum value onto the queueStatusEvent to be consumed by the manager
The StatusEvent enum will have the following values (may be expanded in the future)
STARTED_REPLAYING_EVENTS: when the node begins replaying events
DONE_REPLAYING_EVENTS: when the node is done replaying events
OWN_EVENT_REACHED_CONSENSUS: when a node observes an own event reaching consensus
FREEZE_PERIOD_ENTERED: when a freeze timestamp is passed
FALLEN_BEHIND: when a node determines that it has fallen behind
RECONNECT_COMPLETE: when a reconnect is complete (node may or may not actually be caught up at this point)
STATE_WRITTEN_TO_DISK: when a state is written to disk
CATASTROPHIC_FAILURE: if something happens that the platform can't recover from
It is important that the PlatformStatusManager can continue progressing through states even if consensus isn’t advancing
To make this possible, the manager will use the idleCallback feature that is being added to the AbstractThreadConfiguration (see 05367-tipset-metric for a preview of how this will function)
The idleCallback will allow the PlatformStatusManager to periodically observe the amount of time that has elapsed since specific occurrences, and to progress through the state machine accordingly
Usage of Wall Clock Time
The following state transitions depend on an amount of wall clock time passing
OBSERVING -> CHECKING. Wall clock time is relevant for this transition so that the following edge case is covered:
node X creates some events and gossips them out
before these events are written to the preconsensus event stream, node X crashes
when node X comes back online, it should wait for an amount of wall clock time to elapse before creating new events, so that it can receive from its neighbors those self events which didn't get written to the PCES before the crash
if node X were to immediately begin creating events after booting up, it could potentially create a new event with the same parent as one of the pre-crash events, causing a branch. This should be avoided
OBSERVING -> FREEZING. Wall clock time is relevant in this case for the same reasons as detailed above
ACTIVE -> CHECKING. If a node observes a set amount of wall clock time elapsing without any own events reaching consensus, it should stop accepting app transactions until it sees own events reaching consensus again
this must use wall clock time as opposed to consensus time, since the node ought to transition out of ACTIVE even if consensus time isn't advancing
Future Work
If any use cases arise for a method that would block until a certain status has been transitioned to, one could be implemented
If any use cases arise for a queue flush method, one could be implemented
Proposal
Proposed status state machine:
Implementation proposal:
PlatformStatusManager
, which accepts notifications of events which may affect platform status, and pushes them onto a queue to be processedPlatformStatusManager
will have a methodaddStatusEvent
, which pushes aStatusEvent
enum value onto the queueStatusEvent
to be consumed by the managerStatusEvent
enum will have the following values (may be expanded in the future)STARTED_REPLAYING_EVENTS
: when the node begins replaying eventsDONE_REPLAYING_EVENTS
: when the node is done replaying eventsOWN_EVENT_REACHED_CONSENSUS
: when a node observes an own event reaching consensusFREEZE_PERIOD_ENTERED
: when a freeze timestamp is passedFALLEN_BEHIND
: when a node determines that it has fallen behindRECONNECT_COMPLETE
: when a reconnect is complete (node may or may not actually be caught up at this point)STATE_WRITTEN_TO_DISK
: when a state is written to diskCATASTROPHIC_FAILURE
: if something happens that the platform can't recover fromPlatformStatusManager
can continue progressing through states even if consensus isn’t advancingidleCallback
feature that is being added to theAbstractThreadConfiguration
(see05367-tipset-metric
for a preview of how this will function)idleCallback
will allow thePlatformStatusManager
to periodically observe the amount of time that has elapsed since specific occurrences, and to progress through the state machine accordinglyUsage of Wall Clock Time
The following state transitions depend on an amount of wall clock time passing
OBSERVING
->CHECKING
. Wall clock time is relevant for this transition so that the following edge case is covered:X
creates some events and gossips them outX
crashesX
comes back online, it should wait for an amount of wall clock time to elapse before creating new events, so that it can receive from its neighbors those self events which didn't get written to thePCES
before the crashX
were to immediately begin creating events after booting up, it could potentially create a new event with the same parent as one of the pre-crash events, causing a branch. This should be avoidedOBSERVING
->FREEZING
. Wall clock time is relevant in this case for the same reasons as detailed aboveACTIVE
->CHECKING
. If a node observes a set amount of wall clock time elapsing without any own events reaching consensus, it should stop accepting app transactions until it sees own events reaching consensus againACTIVE
even if consensus time isn't advancingFuture Work
Approvers
Required
@cody-littley @lpetrovic05
Optional
@edward-swirldslabs @poulok