Bor sync stuck at block 0x312d050

eldimious commented 8 months ago

System information

Bor client version: 1.2.1

Heimdall client version: 1.0.3

OS & Version: Linux

Environment: Polygon Mainnet

Type of node: Full

Overview of the problem

I am running a full node using bor and heimdall via docker the last 2 months but seems that the bor sync stucks 11h ago at block 0x312d050. I am getting following logs from bor docker image:

bor                  | WARN [12-26|16:31:24.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:31:36.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:31:36.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:31:40.819] Got new checkpoint from heimdall         start=51,583,142 end=51,583,653 rootHash=0xbaa9de2414f3853a1be0556bd33ca614024e6a8b864940a482e2c84fa1527bf1
bor                  | WARN [12-26|16:31:40.819] Failed to whitelist checkpoint           err="missing blocks"
bor                  | WARN [12-26|16:31:40.819] unable to handle whitelist checkpoint    err="missing blocks"
bor                  | INFO [12-26|16:31:48.813] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:31:48.813] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:00.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:32:00.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:12.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:32:12.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:24.814] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:24.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:36.815] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:36.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | WARN [12-26|16:32:44.111] Snapshot extension registration failed   peer=5f67ba47 err="peer connected on snap without compatible eth support"
bor                  | INFO [12-26|16:32:48.815] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:48.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:33:00.814] Got new milestone from heimdall          start=51,584,893 end=51,584,911 hash=0x72137465e871305c04cae0d017d60848a17b4f70caa302f7d3b8e55615a8ac54
bor                  | WARN [12-26|16:33:00.814] unable to handle whitelist milestone     err="missing blocks"

Any idea how can i fix it? I tried to restart docker image but the error remains.

bgiegel commented 3 months ago

I’m also still very much stuck on this. I’ll add my experience in hope we find a pattern.

I initially had 2 nodes. They were getting stuck so I’ve started killing them automatically with a liveness probe (they run in k3s on a VM). The issue was happening more often so I decided to kill the node more quickly when it get stuck. But this made the problem worse apparently. At some point the node were restarting every 5 minutes. So I did the opposite : let the node run for several hours after it gets stuck. And somehow it made the problem less bad. Now my 2 prod nodes are running fine for several hours a day, but still I get at least 1 to 2 restarts per days...

I then tried to run a third node. I recreated it from scratch using the community managed snapshots. Only one was fast enough to download a full bor mainnet snapshot in a reasonable amount of time :  http://services.stakecraft.com/docs/snapshots/polygon-snapshot (it downloads at around 50MB/s)

I’ve use the default config generated automatically at first. This clearly doesn’t work. I then tried 2 things :

first use the technique mentioned above (list peers and add them to the config in bootnodes,static-nodes and trusted-nodes).
second increase the "maxpeers" to some crazy amount like 3000.

The things I noticed is that my new third nodes have more peers than my production polygon nodes, 28 instead of 16. My new third node seems more stable than the others. So I confirm that it looks related peers. But I don’t know how to increase the number further... I don’t even know what is a reasonnable amount of peers to have..

Note : 

My VM config is 16CPU, 32GB Ram, 8TB of NVMe SSD
bor Version: 1.3.1
heimdall Version: 1.0.5

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

Raneet10 commented 2 months ago

Hey folks, could you please try upgrading to v1.3.3 ? The release contains a number of p2p and sync fixes, which will be followed by some more patches in v1.3.4.

VSGic commented 2 months ago

I have upgraded to 1.3.3 the same issue, I have downloaded old febriary snapshot and it stuck at 4 000 000 blocks before last block approximately, so I used restart script to bring node alive

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 28 days with no activity.

maticnetwork / bor

Bor sync stuck at block 0x312d050 #1115

System information

Overview of the problem