ton-blockchain / TIPs

Improvement Proposal repository
78 stars 16 forks source link

Hard fork process #2

Open tolya-yanot opened 4 years ago

tolya-yanot commented 4 years ago

It is necessary to test the rollback (hard fork) actions against the fatal network shutdown due to a bug / vulnerability / etc.

sonofmom commented 4 years ago

This issue should cover both technical and organisational process in order to coordinate network reboot.

EmelyanenkoK commented 3 years ago

Current state of research is as following:

  1. We can generate new block via create-hardfork, having config master key insert into that block request to config update, in particular we can update ConfigParam34(list of validators) and continue validation with new set. To do so, we
    • Generate config query with config34 update via update-config.fif, ensure that list of validators contains enough validators which will work with hardforked config, currently we insert only one validator to the list
    • Generate new block which at least one hundred blocks earlier than latest block /usr/bin/ton/create-hardfork/create-hardfork -m config-query.boc -D /var/ton-work/db -T \(-1,8000000000000000,1907216\):36CA49D6A2CE2D45923B16F716365D0B344B21DDC57A2D4F4A28D4AE5264160A:E385A14F5C693471FB58FFCA0464AEE8A623267E311632F485521BC7A39300E5 -w -1:8000000000000000
      • Note, timeouts of block creation are quite strict, so if there are too much archived states which should be checked first, generation will fail. To overcome this issue one might increase timeouts in manager-disk and manager-hardfork
    • Check that create-hardfork indeed inserted request: in particular check that randomly skipping external message import because of overdue masterchain catchain rotation event didn't happen
    • insert generated hardfork block into global.config.json under hardfork section
    • start validator with new config, ensure it will start validation in a few minutes
      1. We can not make validation stable by default, in a few tens of blocks check in state_serializer.cpp:155 masterchain_handle_->inited_next_left() causes node process termination.
        • we can mitigate by commenting check, however it looks like the same check stops node from serializing states
  2. We can not connect another node to the first one (using config with hardfork): failed to download state : [Error : 651 : state not found] error arises. Looks like this problem related to the same problem which causes issue in previous paragraph.

Current set of questions we need to investigate:

tolya-yanot commented 3 years ago

Detailed description of the current state of the research

We fork from the last block of the network so that user transactions are not lost.

During the fork, we change the validator set, because the previous validators are not available.

Overall plan

  1. Generate a block using create-hardfork. The new block must contain a new validator set (ConfigParam34).

  2. Update the global.config.json and place the new block in db/static/ folder of first validator. Run first validator.

  3. After successfully launching the first validator: update the global.config.json and place the new block in db/static/ folders of the rest of the nodes. Run nodes.

Detailed plan

1.1 Generate the external message signed by the master key, changing the validator set (Config34) to a new set containing one validator. We use update-config.fif as well as create-state to check the validity of the message.

1.2 Use create-hardfork to create a new block. We indicate in -T the last block of the network, in -m we indicate the external message received in step 1, the -w parameter is ignored by the code and is always equal to masterchain:shardAllId (-1: 8000000000000000)

1.2.1 The -m ( external message) parameter is ignored by the code, so we commented out the line in the code with if (! is_hardfork_) { to make it work https://github.com/newton-blockchain/ton/blob/fd0bd9ed7fedbc87b0b09596b83e88f5ed77efdb/validator/impl/collator.cpp#L220

1.2.2 The -M (binary file with serialized shard top block description) parameter is provided in the create-hardfork code, but it is not implemented and get UNREACHABLE(). Accordingly, we do not use this parameter, but maybe we need use it?

1.2.3 Note, timeouts of block creation are quite strict, so if there are too much archived states which should be checked first, generation will fail. To overcome this issue one might increase timeouts in manager-hardfork

1.2.4 Check that create-hardfork indeed inserted request: in particular check that randomly skipping external message import because of overdue masterchain catchain rotation event didn't happen

2.1. We update the global.config.json of the first validator by inserting the hardfork section there.

2.2 Copy the resulting hardfork-block to /db/static/ folder

2.3 Run the validator with this global config

2.3.1 We don't use --truncate-db parameter of validator-engine, but maybe we need use it?

2.4 After running node with new global config first it sync up to top block. 2.4.1 This process occurs repeatedly: for some reason initial read ends up a few tens blocks behind top block. Thus, restart of already 'synced' node will still cause a few tens block syncing. 2.5 Than node pass to waiting mode being unable to process hardfork block (note that previous attempts when we placed hardfork block behind top block there were no issues in reading and processing hardfork). 2.5.1 Interestingly enough, if during run with new global config also specify --truncate-db to truncate a few hundreds block from the top, node will start syncing and randomly terminates with state_serializer.cpp:155 masterchain_handle_->inited_next_left() https://github.com/newton-blockchain/ton/blob/fd0bd9ed7fedbc87b0b09596b83e88f5ed77efdb/validator/state-serializer.cpp#L158 error.

Questions to next research

  1. Perhaps the number of validators in the new set should be more than 1

  2. It may be necessary to use the -M parameter (binary file with serialized shard top block description) in the create-heardfork

  3. It may be necessary to use the --truncate-db parameter of validator-engine

OmicronTau commented 3 years ago
  1. Hardfork can be done via two utilities: test-ton-collator and create-hardfork. The difference between those two is as following:
    • test-ton-collator applies new block to the database. Thus after collating next block, validator node should be run as is without new hardfork point in global config. In contrast, nodes which will sync from "hardforked node" should have both hardforks and init_block section in global config.
    • create-hardfork generates hardfork block and serializes it to the file under static directory. All nodes should have updated global config.
  2. init_block section in global config may contain either hardfork, or key block after hardfork block id. It is necessary that key block used as init block also has persistent state in database (since that state will be downloaded by other nodes during sync). To force all keyblocks to have persistent state, one my change is_persistent_state and persistent_state_ttl function code
  3. To prevent inited_next_left issue and generally to save all transactions it is recommended to hardfork from the current top block. However, if necessary corresponding check can be commented out and restored after next key block.
  4. Initial synchronization of nodes with hardforked one may take a long time. Checklist here is as follows:
    • short period of time [no nodes] error is ok. If that lasts longer than one minute - check that syncing node is connected to dht server
    • [adnl timeout error] for both proof link download or state download for long period of time may indicate that hardforked node is not connected to the network (syncing node try to ask other nodes with no success). It may be useful to create isolated network by removing all dht-*** and adnl directories from all nodes and dht network and changing port (to make all entities unsearchable). After that dht server logs will allow to understand network map.
    • a lot of getnextkey not inited for a long period of time after some logs about state download are ok. State are often huge and take a lot of time to be downloaded. Do not turn off node and it will eventually sync.