maticnetwork / bor

Official repository for the Polygon Blockchain
https://polygon.technology/
GNU Lesser General Public License v3.0
1k stars 491 forks source link

Bor syncing problem #1238

Closed martinboehm closed 2 months ago

martinboehm commented 4 months ago

Hi, we have two identical Linux Debian 12 servers running Heimdall+Bor mainnet (the latest versions, Bor 1.3.1, however the described behaviour happened with previous versions too), with the following Bor config:

$BOR_BIN server \
  --chain $INSTALL_DIR/genesis.json \
  --syncmode full \
  --datadir $DATA_DIR \
  --bor.heimdall http://127.0.0.1:8173 \
  --bootnodes enode://76316d1cb93c8ed407d3332d595233401250d48f8fbb1d9c65bd18c0495eca1b43ec38ee0ea1c257c0abb7d1f25d649d359cdfe5a805842159cfe36c5f66b7e8@52.78.36.216:30303,enode://b8f1cc9c5d4403703fbf377116469667d2b1823c0daf16b7250aa576bacf399e42c3930ccfcb02c5df6879565a2b8931335565f0e8d3f8e72385ecf4a4bf160a@3.36.224.80:30303,enode://8729e0c825f3d9cad382555f3e46dcff21af323e89025a0e6312df541f4a9e73abfa562d64906f5e59c51fe6f0501b3e61b07979606c56329c020ed739910759@54.194.245.5:30303,enode://681ebac58d8dd2d8a6eef15329dfbad0ab960561524cf2dfde40ad646736fe5c244020f20b87e7c1520820bc625cfb487dd71d63a3a3bf0baea2dbb8ec7c79f1@34.240.245.39:30303 \
  --port 38372 \
  --http \
  --http.addr 0.0.0.0 \
  --http.port 8172 \
  --http.api eth,net,web3,debug,txpool,bor \
  --http.vhosts '*' \
  --http.corsdomain '*' \
  --ws \
  --ws.addr 0.0.0.0 \
  --ws.port 8072 \
  --ws.api eth,net,web3,debug,txpool,bor \
  --ws.origins '*' \
  --gcmode archive \
  --txlookuplimit 0 \
  --cache 4096

One of the servers is running fine, the other is not syncing. When we restart Bor on the non syncing server, it catches the top of the chain and then stops syncing, even when we let it run for hours. This behaviour is repeated over and over again. When the syncing stops, the log from the server looks like this:

WARN [05-03|00:01:04.560] unable to handle whitelist milestone     err="missing blocks"                                                                                                                                                          INFO [05-03|00:01:16.561] Got new milestone from heimdall          start=56,508,327 end=56,508,340 hash=0x5467c0eafa8d8f67949493964701299d49489daa17a3b594dffcf69a1a58d1e1
WARN [05-03|00:01:16.561] unable to handle whitelist milestone     err="missing blocks"
INFO [05-03|00:01:28.561] Got new milestone from heimdall          start=56,508,327 end=56,508,340 hash=0x5467c0eafa8d8f67949493964701299d49489daa17a3b594dffcf69a1a58d1e1
WARN [05-03|00:01:28.561] unable to handle whitelist milestone     err="missing blocks"
INFO [05-03|00:01:28.561] Got new checkpoint from heimdall         start=56,505,766 end=56,507,045 rootHash=0xe244eaa6954021086a513dabe343b5600fc9d40f28c40a23f59281f7dd28a59b

When we query the list of peers on each of the server, both servers report about 16 peers (however each one a different list) but we are not able to transfer the peers from the syncing server to non syncing, Bor is not adding the added peers to the list.

Could you please help us with this strange behaviour? Thank you in advance.

charleswong1025 commented 4 months ago

Hi , Have you solved this problem ? I had same problem with you https://github.com/maticnetwork/bor/issues/1239#issue-2279597974

martinboehm commented 4 months ago

@charleswong1025 No solution so far.

Igorarg91 commented 4 months ago

having the same problem. restart helps node to got synced but after that headlag starts rising again

PekopT commented 4 months ago

Same here. Fix please!

manav2401 commented 4 months ago

Hey folks, could you please upgrade your bor to the latest beta version v1.3.2-beta-2 and see if it helps. Also, it would be good if you can increase the peer count using the --maxpeers flag. Maybe set it to a higher value (e.g. 200). Thanks!

zhangxf55 commented 4 months ago

I have setup a new node with the latest release version and set maxpeers to 200, the problem persists:

May 23 12:24:54 polygon bor[46683]: WARN [05-23|12:24:54.537] Failed to whitelist checkpoint           err="missing blocks"
May 23 12:24:54 polygon bor[46683]: WARN [05-23|12:24:54.537] unable to handle whitelist checkpoint    err="missing blocks"
May 23 12:24:55 polygon bor[46683]: INFO [05-23|12:24:55.493] Looking for peers                        peercount=0 tried=396 static=0
May 23 12:25:02 polygon bor[46683]: INFO [05-23|12:25:02.536] Got new milestone from heimdall          start=57,296,582 end=57,296,607 hash=0x4f42dd6dc1a723ff7fae04f41a9201b4f4109cb10a0f8429e35f5f95c6c99424
May 23 12:25:02 polygon bor[46683]: WARN [05-23|12:25:02.536] unable to handle whitelist milestone     err="missing blocks"
May 23 12:25:05 polygon bor[46683]: INFO [05-23|12:25:05.551] Looking for peers                        peercount=0 tried=398 static=0
May 23 12:25:14 polygon bor[46683]: INFO [05-23|12:25:14.535] Got new milestone from heimdall          start=57,296,582 end=57,296,607 hash=0x4f42dd6dc1a723ff7fae04f41a9201b4f4109cb10a0f8429e35f5f95c6c99424
May 23 12:25:14 polygon bor[46683]: WARN [05-23|12:25:14.535] unable to handle whitelist milestone     err="missing blocks"
May 23 12:25:15 polygon bor[46683]: INFO [05-23|12:25:15.655] Looking for peers                        peercount=0 tried=245 static=0
May 23 12:25:25 polygon bor[46683]: INFO [05-23|12:25:25.676] Looking for peers                        peercount=0 tried=447 static=0

My environments are:

root@polygon:~# bor version
Version: 1.3.2
GitCommit: bade7f57df5c09ae060c15fc66aed488c526149e
root@polygon:~# heimdalld version
1.0.5
root@polygon:~# uname -a
Linux polygon 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux (Ubuntu 22.04)

My bootnodes are:

bootnodes = ["enode://b8f1cc9c5d4403703fbf377116469667d2b1823c0daf16b7250aa576bacf399e42c3930ccfcb02c5df6879565a2b8931335565f0e8d3f8e72385ecf4a4bf160a@3.36.224.80:30303", "enode://8729e0c825f3d9cad382555f3e46dcff21af323e89025a0e6312df541f4a9e73abfa562d64906f5e59c51fe6f0501b3e61b07979606c56329c020ed739910759@54.194.245.5:30303"]
zhangxf55 commented 3 months ago

Hey folks, could you please upgrade your bor to the latest beta version v1.3.2-beta-2 and see if it helps. Also, it would be good if you can increase the peer count using the --maxpeers flag. Maybe set it to a higher value (e.g. 200). Thanks!

I have setup a new node with the latest release version and set maxpeers to 200, the problem persists, cloud you help? The sync speed is slow, and I have take 2 weeks and less than 40% was completed.

kingli-crypto commented 3 months ago

When we restart Bor on the non syncing server, it catches the top of the chain and then stops syncing, even when we let it run for hours.

We have the exact same behaviour, after restoring from a snapshot.

Tested v1.3.2-beta-2 the issue still exists

manav2401 commented 3 months ago

Hey folks, the issues we're seeing are broadly classified into 2 categories. One is those who aren't getting any healthy peers and another is those whose nodes stop syncing randomly despite having good peers.

For the former issue, they're not easily reproducible as all our internal nodes are working well and a lot of partner nodes are also working well. I'd suggest to look into the node config and add more static/trusted peers if possible. The logs of missing blocks which you're seeing in bor is not the cause of syncing issues but it's just an indicator. It's not possible to debug just using this info so please share debug/trace logs in this issue. We have added a new rpc/ipc call debug_peerStats/debug.peerStats() to get info about the peers your node is connected with. Please share the result of this if possible.

For the latter, please upgrade to bor v1.3.3-beta3 as it contains a potential fix.

Thanks!

hdiass commented 3 months ago

Hello, we have been having this issue quite a lot, going deeper into investigation we added some static peers that we knew they were healthy and the issue is gone, so my experience leads me that problem is somewhere on p2p. I don't have enough info if its a lot of bad peers in the network or if something is triggering this from specific peers or just bad peer search, but i guess this is a good starting point for investigation.

zhangxf55 commented 3 months ago

Hello, we have been having this issue quite a lot, going deeper into investigation we added some static peers that we knew they were healthy and the issue is gone, so my experience leads me that problem is somewhere on p2p. I don't have enough info if its a lot of bad peers in the network or if something is triggering this from specific peers or just bad peer search, but i guess this is a good starting point for investigation.

I have also been troubled by this issue for a long time. Could you share your peer information to help me temporarily resolve this problem?

marcello33 commented 3 months ago

v1.3.3 has been released as a stable tag yesterday, in case you folks want to update. We'll keep working on some improvements and potentially bugs fixing wrt sync problems in the upcoming v1.3.4 Thanks for your patience

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stalled for 28 days with no activity.

cshintov commented 2 months ago

@manav2401

Please re-open this. Facing the same exact issue. I've four nodes, 2 in EU and 2 in US. Only one in EU shows this behaviour.

I enabled debug log.. and got this ...

DEBUG[07-12|18:40:37.918] Recovered from EVM failure.              Error:="Found dependency"
DEBUG[07-12|18:40:37.918] Recovered from EVM failure.              Error:="Found dependency"
DEBUG[07-12|18:40:37.918] Recovered from EVM failure.              Error:="Found dependency"
DEBUG[07-12|18:40:37.918] Recovered from EVM failure.              Error:="Found dependency"
DEBUG[07-12|18:40:37.963] Dereferenced trie from memory database   nodes=2382 size=1.11MiB    time=16.494309ms gcnodes=1,032,523 gcsize=428.34MiB gctime=4.915930839s livenodes=1,097,
249 livesize=371.24MiB
DEBUG[07-12|18:40:37.964] Inserted new block                       number=59,280,236 hash=daf8a1..94a7d6 uncles=0 txs=83   gas=10,124,138 elapsed=116.228ms root=b0eccf..143a7e
DEBUG[07-12|18:40:38.016] IP exceeds bucket limit                  ip=51.38.72.15
DEBUG[07-12|18:40:38.036] blockstm exec summary                    execs=85  success=85  aborts=0  validations=85  failures=0  #tasks/#execs=100.00%
DEBUG[07-12|18:40:38.095] Dereferenced trie from memory database   nodes=1075 size=461.26KiB  time=7.589081ms  gcnodes=1,033,598 gcsize=428.79MiB gctime=4.92351855s  livenodes=1,098,
890 livesize=371.81MiB
INFO [07-12|18:40:38.096] Imported new chain segment               number=59,280,237 hash=c8f9a3..9aa886 blocks=18  txs=1412 mgas=325.242 elapsed=4.357s    mgasps=74.647 snapdiffs=4.
89MiB triedirty=481.21MiB
DEBUG[07-12|18:40:38.096] Inserted new block                       number=59,280,237 hash=c8f9a3..9aa886 uncles=0 txs=85   gas=14,977,668 elapsed=130.632ms root=ed7d84..6cb34b
DEBUG[07-12|18:40:38.096] Synchronisation terminated               elapsed=4m20.862s err=nil
DEBUG[07-12|18:40:38.096] Announced block                          hash=c8f9a3..9aa886 recipients=33 duration=2562047h47m16.854s
DEBUG[07-12|18:40:38.109] Reinjecting stale transactions           count=0

Notice the

DEBUG[07-12|18:40:38.096] Synchronisation terminated               elapsed=4m20.862s err=nil

After this it stops syncing... obviously! But why? Basically it catches up to the tip and then immediately stops!

trace logs below.

alloc_logs_polygon-mainnet-bor.stderr.0.tar.gz

cshintov commented 2 months ago

Potential fix. https://github.com/maticnetwork/bor/issues/1239#issuecomment-2235835196