spacemeshos / go-spacemesh

Go Implementation of the Spacemesh protocol full node. 💾⏰💪
https://spacemesh.io
MIT License
748 stars 211 forks source link

Node design is unnecessarily fragile for poet registration phase #6035

Open reythia opened 3 months ago

reythia commented 3 months ago

There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.

It doesn't need to be this way.

Issue 1: Limited retry

Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.

The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.

Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.

Issue 2: Single shot

Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.

This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.

Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.

rnizametdinov commented 2 months ago

very annoying design bug