Closed siddharth-teli closed 1 month ago
how to use state sync? can u share the docs?
I've seen that too, from Telegram https://t.me/c/1331540142/1/17035
On 1.31.0-rc.4 I was not able to sync fresh node into the chain head. After downloading headers to 100% it would fail on:
2023-03-06T23:23:12.119353Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.4 kB/s 0.00 bps 0 gas/s CPU: 1598%, Mem: 33.2 GB
2023-03-06T23:23:22.233504Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1594%, Mem: 33.3 GB
2023-03-06T23:23:32.375861Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1595%, Mem: 33.6 GB
2023-03-06T23:23:42.288617Z WARN sync: State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.437048Z WARN sync: State sync didn't download the state for shard 1 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.557629Z WARN sync: State sync didn't download the state for shard 2 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.604241Z WARN sync: State sync didn't download the state for shard 3 in 60 seconds, sending StateRequest again
2023-03-06T23:23:42.717608Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1596%, Mem: 33.7 GB
2023-03-06T23:23:52.810235Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1594%, Mem: 33.4 GB
2023-03-06T23:24:02.863314Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1593%, Mem: 33.7 GB
2023-03-06T23:24:12.965147Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1593%, Mem: 33.3 GB
2023-03-06T23:24:22.986073Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1597%, Mem: 33.2 GB
2023-03-06T23:24:33.091551Z INFO stats: State Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt[0: parts][1: parts][2: parts][3: parts] 35 peers ⬇ 20.8 MB/s ⬆ 69.5 kB/s 0.00 bps 0 gas/s CPU: 1595%, Mem: 33.4 GB
2023-03-06T23:24:42.399943Z WARN sync: State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.405590Z WARN sync: State sync didn't download the state for shard 1 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.429431Z WARN sync: State sync didn't download the state for shard 2 in 60 seconds, sending StateRequest again
2023-03-06T23:24:42.466833Z WARN sync: State sync didn't download the state for shard 3 in 60 seconds, sending StateRequest again
and finally ended up with crash loop of:
2023-03-07T17:24:49.889550Z INFO db: Created a new RocksDB instance. num_instances=1
2023-03-07T17:24:49.897096Z INFO db: Dropped a RocksDB instance. num_instances=0
2023-03-07T17:24:49.897104Z INFO near: Opening an existing RocksDB database path=/data/neard/.near/data
2023-03-07T17:24:49.897107Z INFO db: Created a new RocksDB instance. num_instances=1
2023-03-07T17:24:50.045035Z INFO stats: #42376888 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 5.08 GB
thread 'actix-rt|system:0|arbiter:0' panicked at 'called Result::unwrap() on an Err value: Chain(DBNotFoundErr("BLOCK: FWJ9kR6KFWoyMoNjpLXXGHeuiy7tEY6GmoFeCA5yuc6b"))', chain/client/src/client_actor.rs:270:47
stack backtrace:
0: rust_begin_unwind
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
3: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
4: tokio::runtime::task::raw::poll
5: tokio::task::local::LocalSet::tick
6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
neard.service: Main process exited, code=killed, status=6/ABRT
I'm testing now on 1.32.0-rc.1 but that takes approx 6 days to conclude (headers sync)
looks worrying, as I don't like to be in a situation that I cannot sync done fresh without snapshot, I rarely use snapshots prefer sync from zero myself
cc @mm-near @nikurt
That is not an issue introduced in 1.31
or in 1.32
.
Is this testnet
or mainnet
?
Neither of explorers can find FWJ9kR6KFWoyMoNjpLXXGHeuiy7tEY6GmoFeCA5yuc6b
or Ayrv5nzZCf42DPFYRxxMmUykra3BBk5Nj5m113WD3Wzt
.
WARN sync: State sync didn't download the state for shard
INFO logs would really be helpful here. I guess that at least one of the many thousands of state parts couldn't be downloaded from some peer. State sync is resource-intensive process and therefore peers throttle state sync requests, which makes timeouts expected.
I don't know the code well enough, but on localnet I've observed that when HEAD moves to another epoch during state sync, then state sync can be aborted or restarted or something like that. Which can lead to the DBNotFoundErr
error.
I used to suspect that DBNotFoundErr
is caused by the user sending Ctrl+C
to the process, but I got this error on localnet also without Ctrl+C
.
Anyway, It's known that State Sync doesn't work reliably, and we've been working for the last 2 quarters on redesigning and reimplementing it. The new State Sync isn't ready yet.
Thank you @kucharskim for your report, this provides more data to support a known issue.
The current solution is to start a node from a snapshot published on S3.
Node sync process documentation: https://github.com/near/nearcore/blob/master/docs/architecture/how/sync.md
I discovered this on testnet and I am still testing this on testnet. I don't know when problem was introduced as I discovered this problem in last month or so. Not that long time ago, just because one of testnet machines had maintenance and I needed to wipe data/
and start from scratch. I don't believe there was SIGINT (CTRL+C) as no-one in the team looks into testnet instances. They just run in systemd and are very low effort.
I don't think I've enabled state sync explicitly. I don't touch testnet instance that much. Probably change config once per year. Is state sync enabled by default? How to disable it then? I am happy to test more, if you need something specific.
I'm currently at:
root@sandbox02 ~ # journalctl -axlfu neard.service | grep INFO | cut -b66-
2023-03-08T11:00:08.793429Z INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 675%, Mem: 6.09 GB
2023-03-08T11:00:22.315175Z INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 707%, Mem: 6.09 GB
2023-03-08T11:00:35.355138Z INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 637%, Mem: 6.09 GB
2023-03-08T11:00:48.940060Z INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 735%, Mem: 6.09 GB
2023-03-08T11:01:02.517151Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 109 kB/s ⬆ 30.6 kB/s 0.00 bps 0 gas/s CPU: 722%, Mem: 6.10 GB
2023-03-08T11:01:15.863584Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 30.8 kB/s 0.00 bps 0 gas/s CPU: 670%, Mem: 6.10 GB
2023-03-08T11:01:29.715121Z INFO stats: #42376888 Downloading headers 3.71% (74273395 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 30.9 kB/s 0.00 bps 0 gas/s CPU: 771%, Mem: 6.11 GB
2023-03-08T11:01:42.723613Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 123 kB/s ⬆ 31.0 kB/s 0.00 bps 0 gas/s CPU: 651%, Mem: 6.11 GB
2023-03-08T11:01:57.060593Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 825%, Mem: 6.11 GB
2023-03-08T11:02:11.264074Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 801%, Mem: 6.12 GB
2023-03-08T11:02:25.353081Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.1 kB/s 0.00 bps 0 gas/s CPU: 787%, Mem: 6.13 GB
2023-03-08T11:02:39.188043Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.3 kB/s 0.00 bps 0 gas/s CPU: 777%, Mem: 6.13 GB
2023-03-08T11:02:53.198027Z INFO stats: #42376888 Downloading headers 3.71% (74273396 left; at 45241951) 18 peers ⬇ 124 kB/s ⬆ 31.4 kB/s 0.00 bps 0 gas/s CPU: 754%, Mem: 6.13 GB
Is state sync enabled by default?
Yes for non-archival nodes. State sync is not applicable to archival nodes. Archival nodes need every block, and therefore need to get every block and process every block. State sync was designed to be a shortcut to skip syncing one or more epochs.
How to disable it then?
Not possible. Either run an archival node or start from an S3 snapshot.
Ah, okay. So it NEEDS to work then. I'm not sure does it work then at the moment at all for non-archival nodes. However, my test is still running (per my earlier comment).
It doesn't work on mainnet, because too many state part requests are sent to a small number of validators, and they heavily throttle those requests. It doesn't work on testnet, because shards 2 and 3 are too big. A single node needs about 2.5 days to create every part for testnet shard 3.
Ah, so I don't need to test - it is know it doesn't work?
Correct, yes
Hey! Please update to the latest version and feel free to re-open if it's still an issue for you
I am running few near nodes locally, I created new account deployed staking pool contract on that account & started running the account as validator node using near core. Intially the num. of validator increments by one but after a while validator node starts throwing warning -
State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
& num. of validator drops by one. After, that i am not able to connect back as validator , is that node blacklisted forever or something? Also, what does this warning mean?