Open tolya-yanot opened 4 years ago
This issue should cover both technical and organisational process in order to coordinate network reboot.
Current state of research is as following:
create-hardfork
, having config master key insert into that block request to config update, in particular we can update ConfigParam34(list of validators) and continue validation with new set. To do so, we
/usr/bin/ton/create-hardfork/create-hardfork -m config-query.boc -D /var/ton-work/db -T \(-1,8000000000000000,1907216\):36CA49D6A2CE2D45923B16F716365D0B344B21DDC57A2D4F4A28D4AE5264160A:E385A14F5C693471FB58FFCA0464AEE8A623267E311632F485521BC7A39300E5 -w -1:8000000000000000
create-hardfork
indeed inserted request: in particular check that randomly skipping external message import because of overdue masterchain catchain rotation
event didn't happenglobal.config.json
under hardfork sectionstate_serializer.cpp:155 masterchain_handle_->inited_next_left()
causes node process termination.
failed to download state : [Error : 651 : state not found]
error arises. Looks like this problem related to the same problem which causes issue in previous paragraph.Current set of questions we need to investigate:
inited_next_left
error on the validator after hardfork?We fork from the last block of the network so that user transactions are not lost.
During the fork, we change the validator set, because the previous validators are not available.
Generate a block using create-hardfork. The new block must contain a new validator set (ConfigParam34).
Update the global.config.json and place the new block in db/static/ folder of first validator. Run first validator.
After successfully launching the first validator: update the global.config.json and place the new block in db/static/ folders of the rest of the nodes. Run nodes.
1.1 Generate the external message signed by the master key, changing the validator set (Config34) to a new set containing one validator. We use update-config.fif
as well as create-state
to check the validity of the message.
1.2 Use create-hardfork
to create a new block. We indicate in -T
the last block of the network, in -m
we indicate the external message received in step 1, the -w
parameter is ignored by the code and is always equal to masterchain:shardAllId
(-1: 8000000000000000
)
1.2.1 The -m
( external message) parameter is ignored by the code, so we commented out the line in the code with if (! is_hardfork_) {
to make it work https://github.com/newton-blockchain/ton/blob/fd0bd9ed7fedbc87b0b09596b83e88f5ed77efdb/validator/impl/collator.cpp#L220
1.2.2 The -M
(binary file with serialized shard top block description) parameter is provided in the create-hardfork
code, but it is not implemented and get UNREACHABLE()
. Accordingly, we do not use this parameter, but maybe we need use it?
1.2.3 Note, timeouts of block creation are quite strict, so if there are too much archived states which should be checked first, generation will fail. To overcome this issue one might increase timeouts in manager-hardfork
1.2.4 Check that create-hardfork
indeed inserted request: in particular check that randomly skipping external message import because of overdue masterchain catchain rotation
event didn't happen
2.1. We update the global.config.json of the first validator by inserting the hardfork section there.
2.2 Copy the resulting hardfork-block to /db/static/
folder
2.3 Run the validator with this global config
2.3.1 We don't use --truncate-db
parameter of validator-engine
, but maybe we need use it?
2.4 After running node with new global config first it sync up to top block.
2.4.1 This process occurs repeatedly: for some reason initial read ends up a few tens blocks behind top block. Thus, restart of already 'synced' node will still cause a few tens block syncing.
2.5 Than node pass to waiting mode being unable to process hardfork block (note that previous attempts when we placed hardfork block behind top block there were no issues in reading and processing hardfork).
2.5.1 Interestingly enough, if during run with new global config also specify --truncate-db to truncate a few hundreds block from the top, node will start syncing and randomly terminates with state_serializer.cpp:155 masterchain_handle_->inited_next_left()
https://github.com/newton-blockchain/ton/blob/fd0bd9ed7fedbc87b0b09596b83e88f5ed77efdb/validator/state-serializer.cpp#L158 error.
Perhaps the number of validators in the new set should be more than 1
It may be necessary to use the -M parameter (binary file with serialized shard top block description) in the create-heardfork
It may be necessary to use the --truncate-db
parameter of validator-engine
test-ton-collator
and create-hardfork
. The difference between those two is as following:
test-ton-collator
applies new block to the database. Thus after collating next block, validator node should be run as is without new hardfork point in global config. In contrast, nodes which will sync from "hardforked node" should have both hardforks
and init_block
section in global config.create-hardfork
generates hardfork block and serializes it to the file under static directory. All nodes should have updated global config.init_block
section in global config may contain either hardfork, or key block after hardfork block id. It is necessary that key block used as init block also has persistent state in database (since that state will be downloaded by other nodes during sync). To force all keyblocks to have persistent state, one my change is_persistent_state
and persistent_state_ttl
function codeinited_next_left
issue and generally to save all transactions it is recommended to hardfork from the current top block. However, if necessary corresponding check can be commented out and restored after next key block.[no nodes]
error is ok. If that lasts longer than one minute - check that syncing node is connected to dht server[adnl timeout error]
for both proof link download or state download for long period of time may indicate that hardforked node is not connected to the network (syncing node try to ask other nodes with no success). It may be useful to create isolated network by removing all dht-*** and adnl directories from all nodes and dht network and changing port (to make all entities unsearchable). After that dht server logs will allow to understand network map.getnextkey not inited
for a long period of time after some logs about state download are ok. State are often huge and take a lot of time to be downloaded. Do not turn off node and it will eventually sync.
It is necessary to test the rollback (hard fork) actions against the fatal network shutdown due to a bug / vulnerability / etc.