start beacon sync before target epoch starts

countvonzero commented 1 year ago

problem statement

quotes from @dshulyak

beacon implementation needs some tweaks to enable nodes updates (not protocol, just simple
software updates) if it is expected to run for weeks and all state is in memory then restarting 
a node doesn't seem that simple also don't know if protocol will be able to finish if some of the 
critical messages will be lost during restart

to elaborate a bit
1. if you restart a node in the middle of the beacon protocol this node won't be able to complete
the beacon. can be fixed by "gracefully" dumping state of the protocol before shutting down the node
2. next concern - node can be restarted during critical rounds (maybe proposals, first round of voting).
so recovery from disk seems questionable as well, definitely hard to make robust
i personally don't know what will happen with beacon protocol state if most of the network will do 
a rolling update over let's say 2 days

research meeting conclusion from Jan 25, 2023

short term

record the next beacon value in proposals (not ballots) so that beacon value can be propagated as soon as it's generated. this helps in the software rolling update case.
in the extreme case where the whole network needs to be restarted, and that restart will interrupt the current beacon protocol, we specify a spacemesh sanctioned beacon value for the target epoch of the interrupted beacon protocol.

long term

the network enables syncing of beacon messages from peers. the volume of the data can be huge, so maybe only archival nodes will save and provide syncing of beacon messages.

dshulyak commented 1 year ago

record the next beacon value in proposals (not ballots) so that beacon value can be propagated as soon as it's generated. this helps in the software rolling update case.

i don't understand this item, ballot is encapsulated into proposal. how moving field one layer up makes any difference?

in the extreme case where the whole network needs to be restarted

this is not an extreme case. lets say there is a new version that has some performance improvements. how do you suggest to make an upgrade? what if we will do 5 releases over first epoch?

basically what this issue says to me is that beacon will be specified manually, i am ok with that

countvonzero commented 1 year ago

i don't understand this item, ballot is encapsulated into proposal. how moving field one layer up makes any difference?

not changing ballot at all. proposals will carry a field next_beacon in addition to the beacon value in ballot epoch data.

this is not an extreme case. lets say there is a new version that has some performance improvements. how do you suggest to make an upgrade? what if we will do 5 releases over first epoch?

ideally we should do a rolling upgrade. which allows some nodes to complete beacon protocol and using proposals to propagate through the network.

if we estimate that we will interrupt the beacon protocol somehow, as you stated, we will manually set beacon with the software upgrade.

dshulyak commented 1 year ago

ideally we should do a rolling upgrade.

but rolling upgrade within which period? assuming beacon runs for 2 weeks - everyone will loose beacon in-memory state during rolling period once. rolling will be over longer periods?

and also rolling is only relevant for the nodes managed by the spacemesh infra, other nodes are not controlled. but i think it doesn't even matter

countvonzero commented 1 year ago

and also rolling is only relevant for the nodes managed by the spacemesh infra, other nodes are not controlled. but i think it doesn't even matter

right. by design beacon protocol finishes before the epoch ends. so the protocol can finish maybe 1 day before the 2-week epoch ends. for a rolling update, if we can ensure there are enough managed nodes that remain online for this period. or we can only do the rolling update bwtn [protocol N end, protocol N+1 start].

if not, we roll out a new software with specified beacon value. which is probably easiest to do all the time at the beginning.

dshulyak commented 1 year ago

i don't think that this plan is good enough, or maybe i just don't understand it.

if next_beacon field is simply extracted from proposal, how do you know which one is honest? what if dishonest was received first? do we also count space units from proposals?

even long term plan relies on the rolling upgrade > 2 weeks, how is that a reasonable assumption?

maybe this idea will be relevant:

extract beacon into a separately managed service that has a following requirements
- implements interface
  - version() Version
  - run(epoch, data) // all required data must be passed here, if interface is not called - service doesn't run
  - outputs() Stream[(epoch, beacon)] // streams epoch, beacon tuples
  - stop() // stop the service with exit(0), exit code must be tracked by maintainer to prevent services restart
- needs to have access to the gossip protocol. libp2p pubsub technically allows to setup a mesh for a specific protocol
how does it help with upgrades?
- core (whatever that is) is upgradable separately, without interrupting the beacon
- core decides which version of the beacon to use in the next epoch and stops the previous beacon version once result is received from a stream

countvonzero commented 1 year ago

if next_beacon field is simply extracted from proposal, how do you know which one is honest? what if dishonest was received first? do we also count space units from proposals?

this would work the same way as the current beacon sync model where we sample enough weight from the network. the threat-model is the same as getting the beacon value for current epoch from the ballots at the beginning of the epoch. the idea is to start the beacon sync early (as soon as the beacon protocol finishes).

extract beacon into a separately managed service that has a following requirements

@lrettig @pigmej can you put this on the research agenda to discuss this proposal?

dshulyak commented 1 year ago

the idea is to start the beacon sync early (as soon as the beacon protocol finishes).

imagine that 10% will leave nodes running and won't update, so only those 10% will know the beacon. is it a reasonable assumption that those 10% can collect enough space units to "sync" the next beacon?

countvonzero commented 1 year ago

is it a reasonable assumption that those 10% can collect enough space units to "sync" the next beacon?

the current setting in beacon sync for mainnet is to collect 800 weight units of ballots/proposals. technically spacemesh can maintain at least 800 weight units online to run the beacon protocol to make it work. @noamnelke is this a reasonable assumption?

dshulyak commented 1 year ago

i think we don't need this change that much if we will persist state and allow sync beacon before epoch start

countvonzero commented 1 year ago

i think we don't need this change that much if we will persist state and allow sync beacon before epoch start

this change, to me, is to allow syncing beacon before epoch starts

countvonzero commented 1 year ago

maybe worth renaming this issue to "start beacon sync before target epoch starts"

spacemeshos / go-spacemesh