Design Proposal: remove pub/sub pattern from internal clock

fasmat commented 1 year ago

Description

go-spacemesh has an internal clock that is used for multiple purposes, but most importantly to signal to other parts of the codebase when layers start and end and when a certain layer has been reached.

The problem with the current implementation is the Subscribe / Unsubscribe methods of the clock. These cause avoidable issues if instead the code would only use AwaitLayer and GetCurrentLayer.

Affected code

TimeClock
Ticker
LayerConverter
and possible other classes / interfaces in the timesync package

Issues with the current approach

The current approach has a few drawbacks:

The Pub / Sub design causes the clock to take longer to signal a layer change the more listeners subscribe to this signal. This can cause performance issues if many parts of the code subscribe to the clock or worse: do not unsubscribe if they are not interested in the signal any more.
If a listener subscribes and then isn't immediately ready to read the update via the channel it received from the subscription the signal it will cause the clock to skip sending the signal. This will cause the subscriber to wait reading on a channel that is never written to or closed leading to a deadlock.

Proposal

go-spacemesh needs the following features across its codebase that are time-related:

CurrentLayer this method needs to return the current layer of the network based on the system time. This might in some cases be not monotonic (i.e. when the system time was changed via NTP).
AwaitLayer this method returns a channel that is closed when the layer in question is reached. This again uses wall time (the system time of the computer) to determine which is the current layer. The returned channel is closed when the layer has been reached OR surpassed (e.g. the system was hibernating for multiple layers - AwaitLayer on all those layers will signal at the same time when the system wakes up). Code that should be executed in a specific layer and uses AwaitLayer should check again with CurrentLayer if the layer in question might have passed already.

Advantages and disadvantages with the proposed approach

AwaitLayer scales to an arbitrary number of listeners without causing performance issues: Only one close operation per layer is performed by the clock, independently of how many listeners await it.
AwaitLayer signals exactly once to all listeners, when the layer is reached or has been passed, The signal can be processed by listeners immediately or later without affecting the clock. Subscribe on the other hand tries to signal to multiple listeners at the same time. In the case of time jumps possibly multiple layers are signalled to multiple listeners. The clock has to skip signalling if a listener doesn't react to the signal immediately or it gets stuck, causing "missed tick" errors.
The returned channel from AwaitLayer signals that the layer has been reached OR passed. The caller needs to make sure that they know which layer they are in by having to call CurrentLayer again after receiving the signal.

countvonzero commented 1 year ago

can you describe/document the problems with the current approach?

fasmat commented 1 year ago

can you describe/document the problems with the current approach?

Updated 🙂

countvonzero commented 1 year ago

first of all, is this design proposal aiming to fix existing issue like #3909, or is this an improvement?

the reason i asked is that the original discussion was triggered by a bug report from testing skipping tick since we missed the time of the tick by more than the allowed threshold missed hare window, skipping layer

just by the description of this proposal, i don't see it will fix #3909. it rather spread this detection of "layer signal arriving too late" to all components that listens to layers.

the proposed design, in essence, pushes the responsibility to the clients to check for whether the layer is ticked at the right time. it's not hugely different from current pub/sub model, where the checking is centralized.

The Pub / Sub design causes the clock to take longer to signal a layer change the more listeners subscribe to this signal. This can cause performance issues if many parts of the code subscribe to the clock or worse: do not unsubscribe if they are not interested in the signal any more.

currently there are 3 components (beacon, miner, events) that calls Subscribe(). so i don't think performance is an issue here.

If a listener subscribes and then isn't immediately ready to read the update via the channel it received from the subscription the signal it will cause the clock to skip sending the signal. This will cause the subscriber to wait reading on a channel that is never written to or closed leading to a deadlock.

this is true. but the current code is using a non-blocking notify (select with default) when sending signal to channel. so i don't see this as a problem.

it seems the current implementation provides two ways of listening to a layer. AwaitLayer() and Subscribe(). and the callers of AwaitLayer() already checks for whether if it matches the layer it cares about, if at all. activation pkg mostly don't care. hare pkg cares and does check. block generator/certifier don't really care.

i am not entirely sure what the intention of this proposal.

fasmat commented 1 year ago

first of all, is this design proposal aiming to fix existing issue like https://github.com/spacemeshos/go-spacemesh/issues/3909, or is this an improvement?

Yes this issue will be fixed with this design proposal.

just by the description of this proposal, i don't see it will fix https://github.com/spacemeshos/go-spacemesh/issues/3909. it rather spread this detection of "layer signal arriving too late" to all components that listens to layers.

the proposed design, in essence, pushes the responsibility to the clients to check for whether the layer is ticked at the right time.

this is correct. The listener of the signal is the only one that can decide how to handle a missed layer. Here's an example:

Current layer 99, listener subscribes to clock to be notified when layer changes happen
Listener is busy processing layer 99
Clock wakes up and determines we are in layer 100, tries to send that information to to client
Client isn't ready to receive signal and clock determines a missed tick. Client is not informed again about a layer change until layer 101

The problem is that both the client and the clock must be ready to receive / send the signal or it will be dropped with "missed tick". The proposed change aims to remove this required synchronization between the clock and its listeners and shifts the responsibility of how to deal with a missed layer or late signal to the client of the clock.

Using AwaitLayer has various other small advantages over Subscribe:

No need to Unsubscribe.
The clock sends the signal in O(1) time instead of O(n) time where n is the number of subscribers
The code in the clock can be simplified if we remove Subscribe, while on the listening side only minor changes are necessary.

currently there are 3 components (beacon, miner, events) that calls Subscribe(). so i don't think performance is an issue here.

it seems the current implementation provides two ways of listening to a layer. AwaitLayer() and Subscribe(). and the callers of AwaitLayer() already checks for whether if it matches the layer it cares about, if at all. activation pkg mostly don't care. hare pkg cares and does check. block generator/certifier don't really care.

This is true, but the code keeps evolving and we have 2 ways of doing the same thing (being signalled about a layer change). Subscribe processes signals in O(n) time (and n might become bigger in the future) and AwaitLayer that does it in O(1) time. Additionally using Subscribe forces a certain handling of missed / skipped layers and isn't even guaranteed to signal the client, while AwaitLayer leaves that responsibility to the client and guarantees a signal.

noamnelke commented 1 year ago

I'll add that it must be the responsibility of the listener to handle late notifications, since every listener will need to handle these differently. Different listeners will have vastly different timeouts (e.g. days for ATXs, seconds for Hare). Some places might want to handle an awaited layer regardless of a timeout (e.g. consider a layer empty if no Hare consensus at some point), while publishing messages should not happen after a certain point.

It's also each listener's responsibility to decide how to handle missed events, so one place might need to handle layers one by one even if we suddenly discover that 100 layers have passed, while other places may want to just skip to the latest layer and handle everything in between in bulk.

poszu commented 1 year ago

Regarding the existing Ticker, I don't understand why it has the "500ms threshold" logic (aka "skipping tick since we missed the time of the tick by more than the allowed threshold") in the first place. Why is it needed?

The "non-blocking" notification has an additional drawback in that slow consumers might not be notified about the last layer if their channel is full. What is needed for it to work nicely, which unfortunately I couldn't find for Go, is something like https://docs.rs/tokio/1.25.0/tokio/sync/watch/index.html:

A single-producer, multi-consumer channel that only retains the last sent value.

fasmat commented 1 year ago

Regarding the existing Ticker, I don't understand why it has the "500ms threshold" logic (aka "skipping tick since we missed the time of the tick by more than the allowed threshold") in the first place. Why is it needed?

The specific part with the 500ms threshold will be removed (WiP: https://github.com/spacemeshos/go-spacemesh/pull/4025)

The "non-blocking" notification has an additional drawback in that slow consumers might not be notified about the last layer if their channel is full. What is needed for it to work nicely, which unfortunately I couldn't find for Go, is something like https://docs.rs/tokio/1.25.0/tokio/sync/watch/index.html:

The problem isn't a full channel but rather that the receiver might not currently be in the select block waiting for the signal and then it's dropped with the "missed tick" log message.

We could implement something similar to the watcher; this would then be our own chan type that rather than blocking on a send/receive retains the last value sent and returns that on receive. If AwaitLayer becomes insufficient for our needs in the future we can look at using such a construct instead, but at the moment I think AwaitLayer is the simpler approach that covers all current requirements to our clock.

spacemeshos / go-spacemesh