Proposal

Proposed status state machine:

PlatformStatus_rework drawio

Implementation proposal:

Create a new PlatformStatusManager, which accepts notifications of events which may affect platform status, and pushes them onto a queue to be processed
The PlatformStatusManager will have a method addStatusEvent, which pushes a StatusEvent enum value onto the queueStatusEvent to be consumed by the manager
The StatusEvent enum will have the following values (may be expanded in the future)
- STARTED_REPLAYING_EVENTS: when the node begins replaying events
- DONE_REPLAYING_EVENTS: when the node is done replaying events
- OWN_EVENT_REACHED_CONSENSUS: when a node observes an own event reaching consensus
- FREEZE_PERIOD_ENTERED: when a freeze timestamp is passed
- FALLEN_BEHIND: when a node determines that it has fallen behind
- RECONNECT_COMPLETE: when a reconnect is complete (node may or may not actually be caught up at this point)
- STATE_WRITTEN_TO_DISK: when a state is written to disk
- CATASTROPHIC_FAILURE: if something happens that the platform can't recover from
It is important that the PlatformStatusManager can continue progressing through states even if consensus isn’t advancing
- To make this possible, the manager will use the idleCallback feature that is being added to the AbstractThreadConfiguration (see 05367-tipset-metric for a preview of how this will function)
- The idleCallback will allow the PlatformStatusManager to periodically observe the amount of time that has elapsed since specific occurrences, and to progress through the state machine accordingly

Usage of Wall Clock Time

The following state transitions depend on an amount of wall clock time passing

OBSERVING -> CHECKING. Wall clock time is relevant for this transition so that the following edge case is covered:
- node X creates some events and gossips them out
- before these events are written to the preconsensus event stream, node X crashes
- when node X comes back online, it should wait for an amount of wall clock time to elapse before creating new events, so that it can receive from its neighbors those self events which didn't get written to the PCES before the crash
- if node X were to immediately begin creating events after booting up, it could potentially create a new event with the same parent as one of the pre-crash events, causing a branch. This should be avoided
OBSERVING -> FREEZING. Wall clock time is relevant in this case for the same reasons as detailed above
ACTIVE -> CHECKING. If a node observes a set amount of wall clock time elapsing without any own events reaching consensus, it should stop accepting app transactions until it sees own events reaching consensus again
- this must use wall clock time as opposed to consensus time, since the node ought to transition out of ACTIVE even if consensus time isn't advancing

Future Work

If any use cases arise for a method that would block until a certain status has been transitioned to, one could be implemented
If any use cases arise for a queue flush method, one could be implemented

hashgraph / hedera-services

Propose design for platform status management #6296

Proposal

Proposed status state machine:

Implementation proposal:

Usage of Wall Clock Time

Future Work

Approvers

Required

Optional