Closed fasmat closed 1 year ago
can you describe/document the problems with the current approach?
can you describe/document the problems with the current approach?
Updated 🙂
first of all, is this design proposal aiming to fix existing issue like #3909, or is this an improvement?
the reason i asked is that the original discussion was triggered by a bug report from testing
skipping tick since we missed the time of the tick by more than the allowed threshold
missed hare window, skipping layer
just by the description of this proposal, i don't see it will fix #3909. it rather spread this detection of "layer signal arriving too late" to all components that listens to layers.
the proposed design, in essence, pushes the responsibility to the clients to check for whether the layer is ticked at the right time. it's not hugely different from current pub/sub model, where the checking is centralized.
The Pub / Sub design causes the clock to take longer to signal a layer change the more listeners subscribe to this signal. This can cause performance issues if many parts of the code subscribe to the clock or worse: do not unsubscribe if they are not interested in the signal any more.
currently there are 3 components (beacon, miner, events) that calls Subscribe()
. so i don't think performance is an issue here.
If a listener subscribes and then isn't immediately ready to read the update via the channel it received from the subscription the signal it will cause the clock to skip sending the signal. This will cause the subscriber to wait reading on a channel that is never written to or closed leading to a deadlock.
this is true. but the current code is using a non-blocking notify (select with default) when sending signal to channel. so i don't see this as a problem.
it seems the current implementation provides two ways of listening to a layer. AwaitLayer()
and Subscribe()
. and the callers of AwaitLayer()
already checks for whether if it matches the layer it cares about, if at all. activation pkg mostly don't care. hare pkg cares and does check. block generator/certifier don't really care.
i am not entirely sure what the intention of this proposal.
first of all, is this design proposal aiming to fix existing issue like https://github.com/spacemeshos/go-spacemesh/issues/3909, or is this an improvement?
Yes this issue will be fixed with this design proposal.
just by the description of this proposal, i don't see it will fix https://github.com/spacemeshos/go-spacemesh/issues/3909. it rather spread this detection of "layer signal arriving too late" to all components that listens to layers.
the proposed design, in essence, pushes the responsibility to the clients to check for whether the layer is ticked at the right time.
this is correct. The listener of the signal is the only one that can decide how to handle a missed layer. Here's an example:
The problem is that both the client and the clock must be ready to receive / send the signal or it will be dropped with "missed tick". The proposed change aims to remove this required synchronization between the clock and its listeners and shifts the responsibility of how to deal with a missed layer or late signal to the client of the clock.
Using AwaitLayer
has various other small advantages over Subscribe
:
Unsubscribe
.currently there are 3 components (beacon, miner, events) that calls
Subscribe()
. so i don't think performance is an issue here.it seems the current implementation provides two ways of listening to a layer.
AwaitLayer()
andSubscribe()
. and the callers ofAwaitLayer()
already checks for whether if it matches the layer it cares about, if at all. activation pkg mostly don't care. hare pkg cares and does check. block generator/certifier don't really care.
This is true, but the code keeps evolving and we have 2 ways of doing the same thing (being signalled about a layer change). Subscribe
processes signals in O(n)
time (and n
might become bigger in the future) and AwaitLayer
that does it in O(1)
time. Additionally using Subscribe
forces a certain handling of missed / skipped layers and isn't even guaranteed to signal the client, while AwaitLayer
leaves that responsibility to the client and guarantees a signal.
I'll add that it must be the responsibility of the listener to handle late notifications, since every listener will need to handle these differently. Different listeners will have vastly different timeouts (e.g. days for ATXs, seconds for Hare). Some places might want to handle an awaited layer regardless of a timeout (e.g. consider a layer empty if no Hare consensus at some point), while publishing messages should not happen after a certain point.
It's also each listener's responsibility to decide how to handle missed events, so one place might need to handle layers one by one even if we suddenly discover that 100 layers have passed, while other places may want to just skip to the latest layer and handle everything in between in bulk.
Regarding the existing Ticker
, I don't understand why it has the "500ms threshold" logic (aka "skipping tick since we missed the time of the tick by more than the allowed threshold") in the first place. Why is it needed?
The "non-blocking" notification has an additional drawback in that slow consumers might not be notified about the last layer if their channel is full. What is needed for it to work nicely, which unfortunately I couldn't find for Go, is something like https://docs.rs/tokio/1.25.0/tokio/sync/watch/index.html:
A single-producer, multi-consumer channel that only retains the last sent value.
Regarding the existing Ticker, I don't understand why it has the "500ms threshold" logic (aka "skipping tick since we missed the time of the tick by more than the allowed threshold") in the first place. Why is it needed?
The specific part with the 500ms threshold will be removed (WiP: https://github.com/spacemeshos/go-spacemesh/pull/4025)
The "non-blocking" notification has an additional drawback in that slow consumers might not be notified about the last layer if their channel is full. What is needed for it to work nicely, which unfortunately I couldn't find for Go, is something like https://docs.rs/tokio/1.25.0/tokio/sync/watch/index.html:
The problem isn't a full channel but rather that the receiver might not currently be in the select
block waiting for the signal and then it's dropped with the "missed tick" log message.
We could implement something similar to the watcher; this would then be our own chan
type that rather than blocking on a send/receive retains the last value sent and returns that on receive. If AwaitLayer
becomes insufficient for our needs in the future we can look at using such a construct instead, but at the moment I think AwaitLayer
is the simpler approach that covers all current requirements to our clock.
Description
go-spacemesh has an internal clock that is used for multiple purposes, but most importantly to signal to other parts of the codebase when layers start and end and when a certain layer has been reached.
The problem with the current implementation is the
Subscribe
/Unsubscribe
methods of the clock. These cause avoidable issues if instead the code would only useAwaitLayer
andGetCurrentLayer
.Affected code
TimeClock
Ticker
LayerConverter
timesync
packageIssues with the current approach
The current approach has a few drawbacks:
Proposal
go-spacemesh needs the following features across its codebase that are time-related:
CurrentLayer
this method needs to return the current layer of the network based on the system time. This might in some cases be not monotonic (i.e. when the system time was changed via NTP).AwaitLayer
this method returns a channel that is closed when the layer in question is reached. This again uses wall time (the system time of the computer) to determine which is the current layer. The returned channel is closed when the layer has been reached OR surpassed (e.g. the system was hibernating for multiple layers -AwaitLayer
on all those layers will signal at the same time when the system wakes up). Code that should be executed in a specific layer and usesAwaitLayer
should check again withCurrentLayer
if the layer in question might have passed already.Advantages and disadvantages with the proposed approach
AwaitLayer
scales to an arbitrary number of listeners without causing performance issues: Only oneclose
operation per layer is performed by the clock, independently of how many listeners await it.AwaitLayer
signals exactly once to all listeners, when the layer is reached or has been passed, The signal can be processed by listeners immediately or later without affecting the clock.Subscribe
on the other hand tries to signal to multiple listeners at the same time. In the case of time jumps possibly multiple layers are signalled to multiple listeners. The clock has to skip signalling if a listener doesn't react to the signal immediately or it gets stuck, causing "missed tick" errors.AwaitLayer
signals that the layer has been reached OR passed. The caller needs to make sure that they know which layer they are in by having to callCurrentLayer
again after receiving the signal.