near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.33k stars 629 forks source link

State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again #7285

Closed siddharth-teli closed 1 month ago

siddharth-teli commented 2 years ago

I am running few near nodes locally, I created new account deployed staking pool contract on that account & started running the account as validator node using near core. Intially the num. of validator increments by one but after a while validator node starts throwing warning - State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again & num. of validator drops by one. After, that i am not able to connect back as validator , is that node blacklisted forever or something? Also, what does this warning mean?

Alien901 commented 2 years ago

how to use state sync? can u share the docs?

kucharskim commented 1 year ago

I've seen that too, from Telegram https://t.me/c/1331540142/1/17035

On 1.31.0-rc.4 I was not able to sync fresh node into the chain head. After downloading headers to 100% it would fail on:

2023-03-06T23:23:12.119353Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.4 kB/s 0.00 bps 0 gas/s CPU: 1598%, Mem: 33.2 GB
2023-03-06T23:23:22.233504Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1594%, Mem: 33.3 GB
2023-03-06T23:23:32.375861Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1595%, Mem: 33.6 GB
2023-03-06T23:23:42.288617Z  WARN sync: State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.437048Z  WARN sync: State sync didn't download the state for shard 1 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.557629Z  WARN sync: State sync didn't download the state for shard 2 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.604241Z  WARN sync: State sync didn't download the state for shard 3 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.717608Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1596%, Mem: 33.7 GB
2023-03-06T23:23:52.810235Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1594%, Mem: 33.4 GB
2023-03-06T23:24:02.863314Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1593%, Mem: 33.7 GB
2023-03-06T23:24:12.965147Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1593%, Mem: 33.3 GB
2023-03-06T23:24:22.986073Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1597%, Mem: 33.2 GB
2023-03-06T23:24:33.091551Z  INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1595%, Mem: 33.4 GB
2023-03-06T23:24:42.399943Z  WARN sync: State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.405590Z  WARN sync: State sync didn't download the state for shard 1 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.429431Z  WARN sync: State sync didn't download the state for shard 2 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.466833Z  WARN sync: State sync didn't download the state for shard 3 in 60 seconds, sending StateRequest again

and finally ended up with crash loop of:

2023-03-07T17:24:49.889550Z  INFO db: Created a new RocksDB instance. num_instances=1
2023-03-07T17:24:49.897096Z  INFO db: Dropped a RocksDB instance. num_instances=0
2023-03-07T17:24:49.897104Z  INFO near: Opening an existing RocksDB database path=/data/neard/.near/data
2023-03-07T17:24:49.897107Z  INFO db: Created a new RocksDB instance. num_instances=1
2023-03-07T17:24:50.045035Z  INFO stats: #42376888 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 5.08 GB
thread 'actix-rt|system:0|arbiter:0' panicked at 'called Result::unwrap() on an Err value: Chain(DBNotFoundErr("BLOCK: FWJ9kR6KFWoyMoNjpLXXGHeuiy7tEY6GmoFeCA5yuc6b"))', chain/client/src/client_actor.rs:270:47
stack backtrace:
   0: rust_begin_unwind
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
   2: core::result::unwrap_failed
             at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
   3: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
   4: tokio::runtime::task::raw::poll
   5: tokio::task::local::LocalSet::tick
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
neard.service: Main process exited, code=killed, status=6/ABRT
kucharskim commented 1 year ago

I'm testing now on 1.32.0-rc.1 but that takes approx 6 days to conclude (headers sync)

kucharskim commented 1 year ago

looks worrying, as I don't like to be in a situation that I cannot sync done fresh without snapshot, I rarely use snapshots prefer sync from zero myself

bowenwang1996 commented 1 year ago

cc @mm-near @nikurt

nikurt commented 1 year ago

That is not an issue introduced in 1.31 or in 1.32. Is this testnet or mainnet? Neither of explorers can find FWJ9kR6KFWoyMoNjpLXXGHeuiy7tEY6GmoFeCA5yuc6b or Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt.

WARN sync: State sync didn't download the state for shard

INFO logs would really be helpful here. I guess that at least one of the many thousands of state parts couldn't be downloaded from some peer. State sync is resource-intensive process and therefore peers throttle state sync requests, which makes timeouts expected.

I don't know the code well enough, but on localnet I've observed that when HEAD moves to another epoch during state sync, then state sync can be aborted or restarted or something like that. Which can lead to the DBNotFoundErr error.

I used to suspect that DBNotFoundErr is caused by the user sending Ctrl+C to the process, but I got this error on localnet also without Ctrl+C.

Anyway, It's known that State Sync doesn't work reliably, and we've been working for the last 2 quarters on redesigning and reimplementing it. The new State Sync isn't ready yet.

Thank you @kucharskim for your report, this provides more data to support a known issue.

The current solution is to start a node from a snapshot published on S3.

Node sync process documentation: https://github.com/near/nearcore/blob/master/docs/architecture/how/sync.md

kucharskim commented 1 year ago

I discovered this on testnet and I am still testing this on testnet. I don't know when problem was introduced as I discovered this problem in last month or so. Not that long time ago, just because one of testnet machines had maintenance and I needed to wipe data/ and start from scratch. I don't believe there was SIGINT (CTRL+C) as no-one in the team looks into testnet instances. They just run in systemd and are very low effort.

I don't think I've enabled state sync explicitly. I don't touch testnet instance that much. Probably change config once per year. Is state sync enabled by default? How to disable it then? I am happy to test more, if you need something specific.

I'm currently at:

root@sandbox02 ~ # journalctl -axlfu neard.service  | grep INFO | cut -b66-
2023-03-08T11:00:08.793429Z  INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 675%, Mem: 6.09 GB
2023-03-08T11:00:22.315175Z  INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 707%, Mem: 6.09 GB
2023-03-08T11:00:35.355138Z  INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 637%, Mem: 6.09 GB
2023-03-08T11:00:48.940060Z  INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 735%, Mem: 6.09 GB
2023-03-08T11:01:02.517151Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 722%, Mem: 6.10 GB
2023-03-08T11:01:15.863584Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 30.8 kB/s 0.00 bps 0 gas/s CPU: 670%, Mem: 6.10 GB
2023-03-08T11:01:29.715121Z  INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 30.9 kB/s 0.00 bps 0 gas/s CPU: 771%, Mem: 6.11 GB
2023-03-08T11:01:42.723613Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 31.0 kB/s 0.00 bps 0 gas/s CPU: 651%, Mem: 6.11 GB
2023-03-08T11:01:57.060593Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 825%, Mem: 6.11 GB
2023-03-08T11:02:11.264074Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 801%, Mem: 6.12 GB
2023-03-08T11:02:25.353081Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 787%, Mem: 6.13 GB
2023-03-08T11:02:39.188043Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.3 kB/s 0.00 bps 0 gas/s CPU: 777%, Mem: 6.13 GB
2023-03-08T11:02:53.198027Z  INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.4 kB/s 0.00 bps 0 gas/s CPU: 754%, Mem: 6.13 GB
nikurt commented 1 year ago

Is state sync enabled by default?

Yes for non-archival nodes. State sync is not applicable to archival nodes. Archival nodes need every block, and therefore need to get every block and process every block. State sync was designed to be a shortcut to skip syncing one or more epochs.

How to disable it then?

Not possible. Either run an archival node or start from an S3 snapshot.

kucharskim commented 1 year ago

Ah, okay. So it NEEDS to work then. I'm not sure does it work then at the moment at all for non-archival nodes. However, my test is still running (per my earlier comment).

nikurt commented 1 year ago

It doesn't work on mainnet, because too many state part requests are sent to a small number of validators, and they heavily throttle those requests. It doesn't work on testnet, because shards 2 and 3 are too big. A single node needs about 2.5 days to create every part for testnet shard 3.

kucharskim commented 1 year ago

Ah, so I don't need to test - it is know it doesn't work?

nikurt commented 1 year ago

Correct, yes

telezhnaya commented 1 month ago

Hey! Please update to the latest version and feel free to re-open if it's still an issue for you