There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.
It doesn't need to be this way.
Issue 1: Limited retry
Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.
The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.
Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.
Issue 2: Single shot
Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.
This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.
Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.
There's a couple of design choices which result in frustrating experiences for users, and issues for poet-operators beyond their control due to behavior at the node level. It's a very fragile design.
It doesn't need to be this way.
Issue 1: Limited retry
Nodes only retry registration for 15-20 minutes, regardless of grace period set. Even the default 12hr servers have a 1hr grace period, but a node with a connection issue in the first 30 minutes will fail to register despite having adequate time before the cycle gap ends.
The 15-20 minute limit is shared by the get proof path, making it impossible to increase registration timeout/retry without also increasing proof start timeout.
Solution: Establish a seperate retry parameter for registration allowing users to make use of the entire grace-period. Better yet, the default behavior should be retry until cycle gap ends as that's what users expect.
Issue 2: Single shot
Nodes fire a batch of registrations to all configured poet servers, if any succeed no retry is made on the other servers. There are obviously many reasons why a system might not be able to reach an external server or get the expected response - that's why retrying network requests is standard practice.
This caused serious isues for Team24 in April - which I originally thought was entirely our fault - whereby some users only registered with one poet due to temporary issues on another poet. Those issues were resolved well within the grace period but I now understand that nodes did not even try to register with the other poet. Unfortunately in the next epoch a poet server died leaving many users unable to proof. It could have been avoided if nodes retried registrations.
Solution: Currently BuildNIPost only attempts registrations if registrations for that node in local.sql has count == 0. It should check registrations vs servers listed in config instead and retry any missing.