Resynchronize from the local DB on version upgrade

roman-khimov commented 1 year ago

We change the DB format from time to time or we change something in native contracts, either way these changes lead to DB incompatibility. When a new NeoGo version starts with the old DB it just refuses to work, errors out and leaves it up to administrator to resolve the problem. It's solved by dropping the old DB and resynchronizing again (usually from the network, managing chain dumps is not fun for an average admin) and our CHANGELOG usually explicitly says that to run this version a DB upgrade is needed. This works fine for public networks, although fetching blocks from the network adds some inevitable overhead.

Now consider NeoFS network with integrated CNs (nspcc-dev/neofs-node#2194) operated by @532910. This network may not have any ordinary NeoGo nodes at all, having just four or seven IR nodes. Then we upgrade from X.Y.Z to A.B.C and there is a note in CHANGELOG saying "please drop the DB and resynchronize". What @532910 will do is invoke ansible -m shell -a 'rm -fr /var/chain.db' ir_nodes, then change the version line in the config and ansible-playbook plays/ir.yml ir_nodes. What he will get as the result is a broken system, there are no more chain copies on the network and effectively it's a clean installation now.

But can he blamed for doing so? I doubt that, because that's what's written in the documentation. The fact that you're supposed to be rolling this update one by one, from node to node, waiting for each of them to synchronize is not written anywhere. Even if it would, it's too easy to make this mistake and break the system.

What can be done instead is handling these upgrades internally in NeoGo by reusing the block data we already have in the DB. This means that we could be reading some old DB format, but it doesn't change often in this particular aspect. This also means that we'll need to handle old and new entries to coexist in the same DB for some time, but this could be handled as well. But what we'll get is more reliable and quick upgrades.

roman-khimov commented 1 year ago

CC @cthulhu-rider.

AnnaShaleva commented 1 year ago

LGTM, we can always iterate through the whole set of blocks stored by storage.ExecBlock executable prefix and use StoreBlock for them one more time to apply the new DB format. The only issue I see is that this scheme can be applied to the archive nodes only. If we have some trimmed state and the old node is run with the old DB format on the trimmed state, we still have to run state sync process from the clean DB and fetch MPT-related data from other nodes.

AnnaShaleva commented 4 months ago

@fyfyrchik, you're interested in this functionality as far, so you may subscribe.

nspcc-dev / neo-go

Resynchronize from the local DB on version upgrade #2976