near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 619 forks source link

Stateless Validation launch doomsday scenarios #11656

Closed pugachAG closed 2 months ago

pugachAG commented 3 months ago

This is a top-level issue to track potential failure scenarios with Stateless Validation launch. This also includes Congestion Control since it is included in the same release.

Congestion Control

cc @wacban

Stateless Validation

Memtrie

State Sync

Chunk Endorsements

pugachAG commented 2 months ago

Large witness test in forknet-20

Chunks with large witness were generated using #11703 on forknet with 20 nodes.

Max witness size

The first experiment was to find the minimum compressed witness size that would result in network not being able to get chunk included in the block. consensus.min_block_production_delay (effectively determines block production time) was set to 1.3 seconds The resulting witness size limit is 25MB.

Then we increased min_block_production_delay to 3 seconds to test network recovery. Network successfully recovered after restart with the updated config. With increased block production time network was still able to make progress with witness size up to 50MB.

Missing chunks with a stream of large witnesses

We are also interested to know what would be a ratio of missing chunks for the shard if we generate large witnesses for all heights. #11771 contains python script used to generate such stream of large witnesses. All graphs below represent number of missing witnesses per minute.

all shards up to 6MB

Only occasional missing chunks are observed:

Screenshot 2024-07-21 at 17 45 47

single shard up to 7MB

Screenshot 2024-07-15 at 13 54 27

all shards 7MB

Much worse than 7MB for one shard, most probably because nodes use 6x more network when distributing witnesses for all shards:

Screenshot 2024-07-21 at 22 46 04

single shard 10 MB

Substantial amount of missing chunks:

Screenshot 2024-07-15 at 14 28 24

all shards 10MB

Screenshot 2024-07-21 at 23 03 27

single shard 15MB

Pretty bad, but still 50%+ of chunks is included:

Screenshot 2024-07-15 at 14 44 53

single shard 20MB

Shard makes some progress at ~25-30%:

Screenshot 2024-07-15 at 15 33 27

up to 3MB on all shards on top of mainnet traffic

It is also important for us to know how much witness size margin we have on top of the current mainnet traffic. With 3MB (which results in more than doubling witness size comparing to the baseline mainnet traffic) we have only occasional chunk misses:

Screenshot 2024-07-22 at 14 46 54
pugachAG commented 2 months ago

Undo block cmd test

neard undo-block can be used to remove the head block from the chain. This can be used to recover the node when it somehow ended up with incorrectly applied chain head block. It means that this node won't be able to make any progress because of chunk extra mismatch for any block built on top.

Testing in forknet

forknet-20 was used to test undo block command:

  1. Pick an active validator node and stop it: mirror --host-filter mocknet-mainnet-118727510-smalltest-40eb stop-nodes
  2. ssh to the node and run undo block cmd: ./binaries/neard1 --unsafe-fast-startup undo-block.
  3. Start the node and make sure it produces blocks and chunks again: mirror --host-filter mocknet-mainnet-118727510-smalltest-40eb start-nodes

neard undo-block logs:

ubuntu@mocknet-mainnet-118727510-smalltest-40eb:~/.near/neard-runner$ ./binaries/neard1 --unsafe-fast-startup undo-block
2024-07-29T10:47:10.525618Z  INFO neard: version="2.0.0-rc.5" build="fbf9e49" latest_protocol=69
2024-07-29T10:47:10.546205Z  INFO config: Validating Config, extracted from config.json...
2024-07-29T10:47:10.552812Z  WARN genesis: Skipped genesis validation
2024-07-29T10:47:10.552841Z  WARN genesis: Skipped genesis validation
2024-07-29T10:47:10.552854Z  INFO config: All validations have passed!
2024-07-29T10:47:10.561251Z  INFO db_opener: Opening NodeStorage path="/home/ubuntu/.near/data" cold_path="none"
2024-07-29T10:47:10.561376Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-07-29T10:47:11.314456Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-07-29T10:47:11.314486Z  INFO db_opener: The database exists. path=/home/ubuntu/.near/data
2024-07-29T10:47:11.314543Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-07-29T10:47:13.775962Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-07-29T10:47:13.776009Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-07-29T10:47:13.782417Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-07-29T10:47:13.782448Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-07-29T10:47:13.865679Z  INFO neard: Trying to update head prev_block_hash=5eDtyLhnmhD9ywKYFccpjq8AGH7DDxWRvuTQeTBMHNue current_head_hash=Ef3DqL2i6A4ztQ5Bz34kmu75UYb1XRrV6FwzytZLT7Mj prev_block_height=118921032 current_head_height=118921033
2024-07-29T10:47:13.898358Z  INFO neard: The current chain store shows new_head_height=118921032 new_header_height=118921032
2024-07-29T10:47:13.906414Z  INFO db: Closed a RocksDB instance. num_instances=0
pugachAG commented 2 months ago

Large witness test in statelessnet

Setup is similar to the one for forknet.

all shards 3MB

Screenshot 2024-07-30 at 12 58 48

all shards 4MB

Screenshot 2024-07-30 at 13 47 07
pugachAG commented 2 months ago

Slow chunk

Slow chunk was reproduced in forknet using sleep method in test contracts added in #11317. We wanted to make sure that #11344 works as expected in forknet with stateless validation. The test was positive, resulting chain:

Screenshot 2024-07-30 at 17 39 56