Closed CherryDT closed 3 years ago
Quick note about the packet issue, that will be fixed with #2143 .
As for the fast sync, I'll explain here what's going on.
Due to their state pruning, our peers may stop serving the entire world state at the pivot block we were downloading the state at. To get back in the range that they're keeping the state for, we repivot to a newer block. That said, all is not lost because much of the trie is common between world states. The root and near-root layers are often all different, however. The drop you see with the percentage is because we are now traversing that new trie and getting a sense for how many nodes we're missing there. The percentage itself is misleading because it's a percentage of known nodes in the trie, which are only discovered as we traverse the children and find out how many children they have etc. It's not a global percentage known ahead of time.
Appreciate the issue report though, I think it's time we change these logs to not give a sense that things are going wrong!
Thank you for the clarification. Can you tell me how many world nodes there are at the moment, though? The sync is at 277500000 at the moment.
(I'm still a bit confused how come it takes so long because it's still not finished and in parallel I also set up an OpenEthereum node which took less than a day to get fully synced and operational, while Besu is running for 5 days now.)
Still syncing:
2021-04-21 09:29:34.107+00:00 | EthScheduler-Services-37 (requestCompleteTask) | INFO | CompleteTaskStep | Downloaded 372600000 world state nodes. At least 26351361 nodes remaining. Estimated World State completion: 93.39 %.
If I understand your previous explanation correctly, it's OK if it "restarts" a few times because some sort of process timed out but the amount of time it takes for the "percentage" to go from 0 to 100 should decrease every time, catching up with those "timeouts" and eventually reaching 100%. But I don't see that happening. It seems it takes more or less the same time (about a day) every time, goes to around 530000000 and I don't see any improvment or progress. And as I wrote yesterday I find it weird that OpenEthereum was able to get fully synced in less than a day while this has been dragging on for 6 days now, it makes me feel like something is wrong here.
What I can do to diagnose the issue better? What would be my next steps towards a solution?
Thank you
Regarding OpenEthereum, that might have to do with their sync protocol, warp sync. We are currently working on a database format that will allow us to also use a faster class of sync algorithm.
Can you provide the full logs of your node and any what flags/config-file you used to configure it? I can look into it to see if there's something strange going on.
Actually, first just try restarting the node. You might have gotten into a state where a peer you have is behind and so you're not getting new pivot blocks near the chain head.
Restart: I did that once before and it didn't help - after your last comment I did it again and this time it actually finished after 2 more days! At least the world state thing finished, it's now continuing a normal sync, and everything seems to be fine now. Thank you!
Description
Sync never finishes. It has been running for days and keeps repeating the following situation (world state sync reaching ~95%, failing, starting from scratch):
Steps to Reproduce (Bug)
Expected behavior: Sync finishes at some point.
Actual behavior:
Downloaded world state nodes
take ages and stops around 95% withFast sync was unable to download the world state. Retrying with a new pivot block.
, starting again from 6%, repeating over and over.Frequency: 100% (it repeated 3 times so far, and I restarted it once)
Versions (Add all that apply)
besu/v21.1.4/linux-x86_64/openjdk-java-11
Linux ip-172-31-4-149 5.4.0-1037-aws #39-Ubuntu SMP Thu Jan 14 02:56:06 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
r5a.2xlarge
Additional Information
It's a fresh install on a new server. There are 725 GB free on the data volume (and 9.5 GB on root).
My config file:
Also, I noticed now that at random intervals I get a few minutes of "bursts" with this error logged several times per second (not sure if it's related, I figured it's because of some bad servers on the network - see https://github.com/hyperledger/besu/issues/2142):