nspcc-dev / neofs-node

NeoFS is a decentralized distributed object storage integrated with the Neo blockchain
https://fs.neo.org
GNU General Public License v3.0
31 stars 38 forks source link

Revise blockchain height check on startup #2426

Open cthulhu-rider opened 1 year ago

cthulhu-rider commented 1 year ago

Inner Ring and Storage nodes check that height of the underlying blockchain height is greater or equal than the latest encountered one optionally persisted in the local storage (config and config respectively).

App requests current height by RPC, compares results with peristed one and fails if the local value is greater.

Which chain is stuck?

according to @aprasolova experience, we an encounter next error in log:

RPC block counter 738108 didn't reach expected height 2272533

It is not visible from the message which chain - main or side - is stuck. It's proposed to reflect blockchain kind in this log message.

Await or not await

it's possible that chain node currently synchronizes its state, and it hasn't reached up-to-date state yet. In this case NeoFS node will immediately fail. In fact, it could wait within some context (global or with some sane deadline) and free admin to periodically restart the app.

btw in code check function is called awaitHeight which syntactically implies a background wait, but in fact does not wait.

maybe there are other signs that will allow NeoFS to understand what exactly is happening at the moment and distinguish between freeze and synchronization, for example If so, then we could improve behavior and admin UX. @AnnaShaleva @roman-khimov

Blockchain reset

if chain was reset, and admin restarts the node - it will fail until fresh chain will reach the height not less than persisted one. In this case it's not obvious for admin that state should be reset too. As possible solution, we could also take into accout blockchain network magic, but it may be also left untouched.

carpawell commented 1 year ago

btw in code check function is called awaitHeight which syntactically implies a background wait, but in fact does not wait.

There is some detail about it. It did wait in #798, but also stopped waiting in the same PR. So mb @532910 has some info about it (and the issue in general).

cthulhu-rider commented 1 year ago

i also started to think about connection switch in multi-RPC setting. @carpawell ur an expert of this currently, pls explain how this reconn could affect our state sync

roman-khimov commented 1 year ago

This block counter can't be perfect since local state can be dropped at any time. But it helps in some ways, so:

distinguish between freeze and synchronization

No 100% reliable way to do that. But StartWhenSynchronized RPC option helps somewhat, at least the node is supposed to be up to date when it starts serving RPC (so this problem shouldn't happen at all).

if chain was reset

Just forget this for now.