World state sync never finishes and keeps repeating from scratch

CherryDT commented 3 years ago

Description

Sync never finishes. It has been running for days and keeps repeating the following situation (world state sync reaching ~95%, failing, starting from scratch):

2021-04-19 10:27:08.832+00:00 | EthScheduler-Services-8 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 531300000 world state nodes. At least 26689927 nodes remaining. Estimated World State completion: 95.22 %.
2021-04-19 10:27:17.642+00:00 | EthScheduler-Services-8 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 531400000 world state nodes. At least 26613337 nodes remaining. Estimated World State completion: 95.23 %.
2021-04-19 10:27:26.358+00:00 | EthScheduler-Services-8 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 531500000 world state nodes. At least 26537010 nodes remaining. Estimated World State completion: 95.24 %.
2021-04-19 10:27:35.209+00:00 | EthScheduler-Services-8 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 531600000 world state nodes. At least 26459998 nodes remaining. Estimated World State completion: 95.26 %.
2021-04-19 10:27:43.964+00:00 | EthScheduler-Services-8 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 531700000 world state nodes. At least 26383176 nodes remaining. Estimated World State completion: 95.27 %.
2021-04-19 10:27:47.115+00:00 | nioEventLoopGroup-3-5 | WARN  | FastSyncDownloader | Fast sync was unable to download the world state. Retrying with a new pivot block.
2021-04-19 10:27:47.115+00:00 | nioEventLoopGroup-3-5 | INFO  | FastSyncDownloader | Starting fast sync.
2021-04-19 10:27:47.116+00:00 | nioEventLoopGroup-3-5 | INFO  | FastSyncActions | Selecting block number 12269912 as fast sync pivot block.
2021-04-19 10:27:47.117+00:00 | nioEventLoopGroup-3-5 | INFO  | PivotBlockRetriever | Retrieve a pivot block that can be confirmed by at least 5 peers.
2021-04-19 10:27:47.117+00:00 | nioEventLoopGroup-3-5 | INFO  | PivotBlockConfirmer | Confirm pivot block 12269912 with at least 5 peers.
2021-04-19 10:27:47.155+00:00 | nioEventLoopGroup-3-9 | INFO  | PivotBlockConfirmer | Received 1 confirmation(s) for pivot block header 12269912: 0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587
2021-04-19 10:27:47.194+00:00 | nioEventLoopGroup-3-8 | INFO  | PivotBlockConfirmer | Received 2 confirmation(s) for pivot block header 12269912: 0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587
2021-04-19 10:27:47.201+00:00 | nioEventLoopGroup-3-9 | INFO  | PivotBlockConfirmer | Received 3 confirmation(s) for pivot block header 12269912: 0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587
2021-04-19 10:27:47.214+00:00 | nioEventLoopGroup-3-1 | INFO  | PivotBlockConfirmer | Received 4 confirmation(s) for pivot block header 12269912: 0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587
2021-04-19 10:27:47.334+00:00 | nioEventLoopGroup-3-1 | INFO  | PivotBlockConfirmer | Confirmed pivot block at 12269912: 0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587
2021-04-19 10:27:47.337+00:00 | nioEventLoopGroup-3-1 | INFO  | WorldStateDownloader | Begin downloading world state from peers for block 12269912 (0x2ebc69c9e68410e70e4f37854509a0b9ea16aead68af7d48ea14b51fec904587). State root 0x9f8040bd1aa73216a8c335a28f9111d7b0f2e7dc7423dcce511253fe6d4990fb
2021-04-19 10:27:47.615+00:00 | nioEventLoopGroup-3-9 | INFO  | SyncTargetManager | Found common ancestor with peer Peer 0x0c6a27649764e9f8b6... at block 12262153
2021-04-19 10:28:09.211+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12262154 to 12262353, Peers: 25
2021-04-19 10:28:14.084+00:00 | EthScheduler-Services-28 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 100000 world state nodes. At least 1400433 nodes remaining. Estimated World State completion: 6.66 %.
2021-04-19 10:28:14.783+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12262354 to 12262553, Peers: 25
2021-04-19 10:28:19.047+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12262554 to 12262753, Peers: 25
2021-04-19 10:28:23.765+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12262754 to 12262953, Peers: 25
2021-04-19 10:28:29.507+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12262954 to 12263153, Peers: 25
2021-04-19 10:28:33.025+00:00 | EthScheduler-Services-28 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 200000 world state nodes. At least 2122681 nodes remaining. Estimated World State completion: 8.61 %.
2021-04-19 10:28:35.042+00:00 | EthScheduler-Services-34 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 12263154 to 12263353, Peers: 25

Steps to Reproduce (Bug)

Install
Wait
Be frustrated ...well not sure what to write here because I didn't do anything special.

Expected behavior: Sync finishes at some point.

Actual behavior: Downloaded world state nodes take ages and stops around 95% with Fast sync was unable to download the world state. Retrying with a new pivot block., starting again from 6%, repeating over and over.

Frequency: 100% (it repeated 3 times so far, and I restarted it once)

Versions (Add all that apply)

Software version: besu/v21.1.4/linux-x86_64/openjdk-java-11

Java version:

openjdk 11.0.10 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

OS Name & Version:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Kernel Version: Linux ip-172-31-4-149 5.4.0-1037-aws #39-Ubuntu SMP Thu Jan 14 02:56:06 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cloud VM, type, size: AWS r5a.2xlarge

Additional Information

It's a fresh install on a new server. There are 725 GB free on the data volume (and 9.5 GB on root).

My config file:

data-path="/storage/besudata"

rpc-http-enabled=true
rpc-ws-enabled=true

Also, I noticed now that at random intervals I get a few minutes of "bursts" with this error logged several times per second (not sure if it's related, I figured it's because of some bad servers on the network - see https://github.com/hyperledger/besu/issues/2142):

2021-04-18 20:24:03.416+00:00 | vert.x-eventloop-thread-2 | ERROR | VertxPeerDiscoveryAgent | Encountered error while handling packet
java.lang.IllegalArgumentException: Could not recover public key
        at org.hyperledger.besu.crypto.SECP256K1.recoverFromSignatureNative(SECP256K1.java:495)
        at org.hyperledger.besu.crypto.SECP256K1.recoverPublicKeyFromSignature(SECP256K1.java:370)
        at org.hyperledger.besu.ethereum.p2p.discovery.internal.Packet.<init>(Packet.java:90)
        at org.hyperledger.besu.ethereum.p2p.discovery.internal.Packet.decode(Packet.java:126)
        at org.hyperledger.besu.ethereum.p2p.discovery.VertxPeerDiscoveryAgent.lambda$handlePacket$6(VertxPeerDiscoveryAgent.java:221)
        at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:313)
        at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)

RatanRSur commented 3 years ago

Quick note about the packet issue, that will be fixed with #2143 .

As for the fast sync, I'll explain here what's going on.

Due to their state pruning, our peers may stop serving the entire world state at the pivot block we were downloading the state at. To get back in the range that they're keeping the state for, we repivot to a newer block. That said, all is not lost because much of the trie is common between world states. The root and near-root layers are often all different, however. The drop you see with the percentage is because we are now traversing that new trie and getting a sense for how many nodes we're missing there. The percentage itself is misleading because it's a percentage of known nodes in the trie, which are only discovered as we traverse the children and find out how many children they have etc. It's not a global percentage known ahead of time.

Appreciate the issue report though, I think it's time we change these logs to not give a sense that things are going wrong!

CherryDT commented 3 years ago

Thank you for the clarification. Can you tell me how many world nodes there are at the moment, though? The sync is at 277500000 at the moment.

(I'm still a bit confused how come it takes so long because it's still not finished and in parallel I also set up an OpenEthereum node which took less than a day to get fully synced and operational, while Besu is running for 5 days now.)

CherryDT commented 3 years ago

Still syncing:

2021-04-21 09:29:34.107+00:00 | EthScheduler-Services-37 (requestCompleteTask) | INFO  | CompleteTaskStep | Downloaded 372600000 world state nodes. At least 26351361 nodes remaining. Estimated World State completion: 93.39 %.

If I understand your previous explanation correctly, it's OK if it "restarts" a few times because some sort of process timed out but the amount of time it takes for the "percentage" to go from 0 to 100 should decrease every time, catching up with those "timeouts" and eventually reaching 100%. But I don't see that happening. It seems it takes more or less the same time (about a day) every time, goes to around 530000000 and I don't see any improvment or progress. And as I wrote yesterday I find it weird that OpenEthereum was able to get fully synced in less than a day while this has been dragging on for 6 days now, it makes me feel like something is wrong here.

What I can do to diagnose the issue better? What would be my next steps towards a solution?

Thank you

RatanRSur commented 3 years ago

Regarding OpenEthereum, that might have to do with their sync protocol, warp sync. We are currently working on a database format that will allow us to also use a faster class of sync algorithm.

Can you provide the full logs of your node and any what flags/config-file you used to configure it? I can look into it to see if there's something strange going on.

RatanRSur commented 3 years ago

Actually, first just try restarting the node. You might have gotten into a state where a peer you have is behind and so you're not getting new pivot blocks near the chain head.

CherryDT commented 3 years ago

Restart: I did that once before and it didn't help - after your last comment I did it again and this time it actually finished after 2 more days! At least the world state thing finished, it's now continuing a normal sync, and everything seems to be fine now. Thank you!

hyperledger / besu