Bor sync stuck at block 0x312d050

eldimious commented 10 months ago

System information

Bor client version: 1.2.1

Heimdall client version: 1.0.3

OS & Version: Linux

Environment: Polygon Mainnet

Type of node: Full

Overview of the problem

I am running a full node using bor and heimdall via docker the last 2 months but seems that the bor sync stucks 11h ago at block 0x312d050. I am getting following logs from bor docker image:

bor                  | WARN [12-26|16:31:24.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:31:36.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:31:36.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:31:40.819] Got new checkpoint from heimdall         start=51,583,142 end=51,583,653 rootHash=0xbaa9de2414f3853a1be0556bd33ca614024e6a8b864940a482e2c84fa1527bf1
bor                  | WARN [12-26|16:31:40.819] Failed to whitelist checkpoint           err="missing blocks"
bor                  | WARN [12-26|16:31:40.819] unable to handle whitelist checkpoint    err="missing blocks"
bor                  | INFO [12-26|16:31:48.813] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:31:48.813] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:00.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:32:00.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:12.814] Got new milestone from heimdall          start=51,584,847 end=51,584,869 hash=0x112ae9614d96a0db2fb572d324f1ca505983ef0b309b1c0970f698994964bb89
bor                  | WARN [12-26|16:32:12.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:24.814] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:24.814] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:32:36.815] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:36.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | WARN [12-26|16:32:44.111] Snapshot extension registration failed   peer=5f67ba47 err="peer connected on snap without compatible eth support"
bor                  | INFO [12-26|16:32:48.815] Got new milestone from heimdall          start=51,584,870 end=51,584,892 hash=0xdef7276b17971f87470ffa0c516ec2a1de75fd12564106af2771f084d7bc63e8
bor                  | WARN [12-26|16:32:48.815] unable to handle whitelist milestone     err="missing blocks"
bor                  | INFO [12-26|16:33:00.814] Got new milestone from heimdall          start=51,584,893 end=51,584,911 hash=0x72137465e871305c04cae0d017d60848a17b4f70caa302f7d3b8e55615a8ac54
bor                  | WARN [12-26|16:33:00.814] unable to handle whitelist milestone     err="missing blocks"

Any idea how can i fix it? I tried to restart docker image but the error remains.

psmahlii commented 10 months ago

same

eldimious commented 10 months ago

I tried also to debug.setHead to change head to some blocks (1000+ behind) and for some hours started sync again, but then stopped again.

Any idea how to fix it?

GeassV commented 10 months ago

Same issue. Same version for bor and heimdall. Although stuck at a different height(51640301), constantly getting WARN log with Dec 28 14:30:33 203078 bor[1828860]: WARN [12-28|14:30:33.977] unable to handle whitelist milestone err="missing blocks"

0xKrishna commented 10 months ago

Can you check your peers using ipc and admin.peers command ?

eldimious commented 10 months ago

@0xKrishna

enode://11e0cbb03a834019b0222f54bccf32512bef4294dd722642684762d1d01c84031c1075767195d9968dcdb9e38326f08b14547d8e33b0b67a0ef1aa0b045845d0@35.171.120.130:30303?discport=30315,
enode://b0f026f7ccfd5c1450e933572ae44b262a7d084647a30d0a8d9e2c8cab8d5b1c7721f3c60bfcd50c0fede114c7e2d316649389ba2449ca85d1ddd9e2947f1c28@147.135.100.106:30303?discport=30334,
enode://2d4bd1fa38182fa868a583fc946c8d5e4043b013381cf20927c16cf8f17b4f3e793c5e9f34fc785c52d887aab07181bdb0ebae50d9e3f05e5c14aed19f81929a@65.108.127.87:30303?discport=30340,
enode://ab879b4eaacf495ec760f2806e78509da80e327ba4262d8153698f88b0a95287a692bbaf3a3cece9ad27f889246c04e2b5ca8e75bf083acbb4806eb669cc3a77@35.171.120.130:30303?discport=30334,
enode://1a69f7dae12959a358b92a395ec79de2ab4601a59a5b0b951d4e6247da2101d7d6d77a919086251e70b552a49ae74d630e19233306a189a1b627c2115ecf3cfa@34.203.27.246:30303?discport=30320,
enode://574a9195f40a7c4bd68536167ef53a7385bab8934dfc8db94d013b1a73af76eb73f148536cb8b8365e8240728f6e80af0ddb4ead3a2544de907cce561839ce61@51.81.217.117:30303?discport=30323,
enode://142cce22e125325f4895b2268e32185f5dbe90f9c818ab135f16c7face23a55b46d0b78a0286595a262d4fa58ff314e7e2553e13f528a3c3e9616184b77f5b85@65.108.127.87:30303?discport=30323,
enode://50c8f9d2849a209383edd15dfd67ba0a8d3f5e9853fd1af9c1678f4aef2dc5e3817c34ddce9390d5e8dd4891ad7f66003a3bea5af9e288df6f26ed070d9bd741@54.38.217.112:30303?discport=30335,
enode://72be2da5ba01bc2f3a7764bf1d4f18550a36df629820ea0f6d37fe1cd1355d0f1c201b2a5f382e794ee56e0f5befa504e85e96548a45a0fba44bb6bd1075e28e@54.38.155.225:30303?discport=30306,
enode://53b53f55f2a1674873f8f58ee23616db8384f278a1206cf79c8c18d4ebc32b4424128229de2ea999803c08c9262974f1fb1f2b0d87ca6ec40aea1594c0ba0ef7@65.108.1.189:30303?discport=30337,
enode://eb0ee5596ea6df526eb7e0ace41f015bcb9ee4f27996c72ea15d1cd28ec69f89b6e64247696c0150111b52ca58810f5d0f42d59ac38fdb26ba7323bcc835475a@51.81.196.100:30303?discport=30313,
enode://c4a2a7c422ddce70a39164ce53762262bd5dc8917f5613b1c92c94affb36516e63f88721763a1dcfed5f36403e0fc21894e34c2981f2f6f1f100b9f186a986a1@51.38.72.15:30303?discport=30307,
enode://2197472b27c39587e2ae2c199e91527a25d25b2c1217f14c8d8b342068209a889913c7c1eb6f60044a0d28bd59ccec157d18ebb7918293e8878d11185831cf22@54.38.75.21:30303?discport=30320,
enode://b6d9bef47ce86b94331cdcfd2a1a91f28ab48db171aa70659973b3869988e7e4806fd24406c6f57187664643dffc0edf74e7a16ac315ca7933589357ec875550@51.38.72.15:30303?discport=30311,
enode://4585b746a2ae2f74575313199bd35159e8b679608fa1bd4e3a2823c0c24f8e49f9cb1e0c312de30a8b08c16a6666101897ffff47a6c162dca6ddb87c206c4cd2@66.70.233.151:30303?discport=30313,
enode://c8ab3d6ec8d7c1c7df462f55f02acaced2949ec4542475fa25ebb104feaa78a196f0e39cfc2bf1236ead1c647b734726cb9f4f03eb933c94f318cca160e5ce16@54.38.217.112:30303?discport=30334

GeassV commented 10 months ago

Can you check your peers using ipc and admin.peers command ?

Sure. I rolled back to bor v1.1.0 from v1.2.1 because some issues said that rollingback might be a solution. and then found an interesting performance every time I restart bor service, it syncs for a while and then stuck with the above "whitelist milestone" log.

> admin.peers [{ caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://6de3bbba54699dcc11b982c7970fdd938946d3638bab27d9006698b998447cf838891a310c61d6c74d042091366ba07690ef1f09a026fab28a31a06cd387b67b@13.57.125.97:30321", enr: "enr:-KO4QIY12LW3IWDW2JzqMdtg9Pyv7PEASdnlLFAEzUEuzOgVEvW5hWe2EB_Jd6iqKnRHi_SyP1INx6iDk3a6CMoyqOqGAYvR7YGvg2V0aMfGhNwIhlyAgmlkgnY0gmlwhA05fWGJc2VjcDI1NmsxoQNt47u6VGmdzBG5gseXD92TiUbTY4urJ9kAZpi5mER8-IRzbmFwwIN0Y3CCdnGDdWRwgnZx", id: "136d74cf29e85b49f991b1d97b5800f1a45968b0542642c47c970c1502762313", name: "bor/v1.1.0/linux-amd64/go1.20.10", network: { inbound: false, localAddress: "172.18.35.78:37836", remoteAddress: "13.57.125.97:30321", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://b8187a46754cdf631d67b89e3e73d5e061ab2ce5a62cc8a79cfd754b04dc5394b381f1d99d59a8b6baeb68b4c019512b59dcbdc0cb682320f96508331cf8e8f3@54.38.217.112:30303?discport=30324", id: "1c405a70749de50ea441c6c59c07e7d4dde5e18f47102a20b88db98cddcbb6a2", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:51320", remoteAddress: "54.38.217.112:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://256fe3efb2f83e4821f4d028273757e525da48bb69a3da5c4230a410d5b96e948a79ae42e60a4914092249ee3bb928756534c67b6c3003f0d08a180373735edc@65.108.1.189:30303?discport=30395", enr: "enr:-KO4QHQlnI0aegmfJbdsiPIskZywzNjBmulaKf9scy3wuCR_XirUnjEjwSsDfjJe40LWodLNpjLDW48N4MtdFEXOXh6GAYx2yUm_g2V0aMfGhNwIhlyAgmlkgnY0gmlwhEFsAb2Jc2VjcDI1NmsxoQIlb-Pvsvg-SCH00CgnN1flJdpIu2mj2lxCMKQQ1blulIRzbmFwwIN0Y3CCdl-DdWRwgna7", id: "3e8f038a2af1414377f24cacf7e6591b4007c60b8de292b7bec24d7a27cd9c49", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:52390", remoteAddress: "65.108.1.189:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://2cd2be98b78f486171994f32ca995f4d53a783172f360a9224181c3cb1b487bd88e95658cb05405642ee2455fc31ae0919f8b2699cc02ed9ed2aef09b9fc93c2@54.38.216.84:30303?discport=30331", enr: "enr:-KO4QN1KbAC8kuy161pxm8kHqtI8VMjk9cQjVFJT4s6TH3G-LJK4QAdY7LqugQ8Yt8-hYUzFDrqoaMFR3xQVhQHoH46GAYyGmlAzg2V0aMfGhNwIhlyAgmlkgnY0gmlwhDYm2FSJc2VjcDI1NmsxoQIs0r6Yt49IYXGZTzLKmV9NU6eDFy82CpIkGBw8sbSHvYRzbmFwwIN0Y3CCdl-DdWRwgnZ7", id: "496c218828d2d1864a9e228e7ad33a481ae60acb81becfb2e565053f4e1f1a5c", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:47924", remoteAddress: "54.38.216.84:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://994252f3fbe56302ba967cab1f01fada30ef8fdb335e6f974a55dd258c2052d1c8c7f181c147d3958ca7e5c7aec76f4f316f50891b137dcbcfd811e453f9d8cc@135.125.214.37:30303?discport=30340", id: "6bcba20976d073441dfdda8631ddf8fc0db9056e00485e8fe49717dac36560df", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:44738", remoteAddress: "135.125.214.37:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://29e354ff99595687d321d44b72c0e458f481046edd8d18fc5db69df0d61a44068ce9c715d74651d7c635688962f54251af861b13e5b31b4da54bb2c9f05ac794@5.9.87.183:30303?discport=30495", enr: "enr:-KO4QKwM2X_BENPlgEwVZ9SQjAMLtFF1dbJe9lmJ7eW42ai2R7ZAQ6Gc4Xzy2_BJOXsA8sESHmXeLvCGIINbAqjPxDWGAYyF1OC3g2V0aMfGhNwIhlyAgmlkgnY0gmlwhAUJV7eJc2VjcDI1NmsxoQIp41T_mVlWh9Mh1EtywORY9IEEbt2NGPxdtp3w1hpEBoRzbmFwwIN0Y3CCdl-DdWRwgncf", id: "6f1be92e4e8cb5f36e2d2e988d60d492a5992524258fab93ae146a335a8f690a", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:54748", remoteAddress: "5.9.87.183:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://b9e2f920d31ea6cde2ad56fcd1904455d911ccf58201551c22d41c28f5a1b1d20a67c8db30893651d8a47bfe21a95705505c079892290a8cfad06f1b8c425628@44.221.198.244:30303?discport=30316", id: "7752490f98a21bde471c9151b7bfe28347cf83a0813a9fe6e66320ae63152f5b", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:41940", remoteAddress: "44.221.198.244:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://9f1443433c1b1b79ccc2d95f314c4e0823d0b549d1db43e5e0a2fe3a87fdaeb2d693fa4a8e75fd6a77c2917598d91782fb75b8fc6357c4f13073653894418acb@66.70.207.63:30303?discport=30309", id: "8df6a54d5bc8fcac07f8ece1d738414190fc9fe3400776abb33471b9ead46344", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:53838", remoteAddress: "66.70.207.63:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://6668bb0a2ede7963ebc196f5e2c8e4daf480a1b7510b74ad18491d733ccf32ab754b44422e4d40fb88c996a3d33fa08dc96461d77693c4a7976cadef4340ca71@148.113.163.85:30303?discport=30309", id: "8e60fc39583410b077016422c96f36ecc60f077a4910a8848917dd1e5856c4e4", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:36704", remoteAddress: "148.113.163.85:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://298ba98e471a44af8638c297d4f25060119817d20cd49870717cfef0f92d3d3d1e3039b1b5fcd34ef66e5ef97efefb9d38e68eed20d1eec5929dfc422a3731e9@3.219.138.93:30306", id: "90871a5e7b702d78f49f829b75d44728628d6a0448d2e128dee96d3e8a39383e", name: "bor/v1.1.0/linux-amd64/go1.20.10", network: { inbound: false, localAddress: "172.18.35.78:39982", remoteAddress: "3.219.138.93:30306", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://66153dd3af7f793158934d9bd121f68e1e8c5a4c15d3316f2e222e6743f8a46fb02a3b6e70181521c0f82584ebd8b690fcf7c3056d5b78293f1bbe065f038ed9@54.235.96.140:30306", id: "93c951775b564631f98affc9e4539b91daa825e350de64a3a0b760a65d0a7826", name: "bor/v1.1.0/linux-amd64/go1.20.10", network: { inbound: false, localAddress: "172.18.35.78:37624", remoteAddress: "54.235.96.140:30306", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://697850d0a936d1d63d047ce480e6f39f429f2c33cfeec335526fb1e97aa0a11a43065bad4b0e8223ca053f91307a0a672d79586c4efdb81f531122116e6d132f@15.204.47.194:30303?discport=30340", id: "96b764ec1ca7771bdb60b464e498824b22dfc7c7cd8d8a3c28cb9ce4241d72dc", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:33882", remoteAddress: "15.204.47.194:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://a34a45e54b28eef5cc58e66a932471ffa3d914af052346b423117972aa957d0816f79492e657ccf1f356713f5959274d5f39573acde4d64e00a656ae999f0a30@65.108.127.87:30303?discport=30376", id: "9ede61e13d949a6ff325274262cf677d16093daf8be60c441707c8ba047526d3", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:47926", remoteAddress: "65.108.127.87:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68", "snap/1"], enode: "enode://af51799ca42c94ff9db93aa933dad4d7ae5979153658df2a38f90c38654391f8a929c8d6af7cb04ea151f009a2b163d6458a71662d512adf1d300ea49107738f@5.9.87.183:30303?discport=30432", id: "a51dc5db9ffc3dbd5b5c67ed1925a486788b5e7668ca0c624b31468b4090f000", name: "bor/v1.0.6/linux-amd64/go1.20.8", network: { inbound: false, localAddress: "172.18.35.78:40188", remoteAddress: "5.9.87.183:30303", static: false, trusted: false }, protocols: { eth: { version: 68 }, snap: { version: 1 } } }, { caps: ["eth/66", "eth/67", "eth/68"], enode: "enode://76d2d6284ee5637113e3669e0fdff0fca83535e39ee0752b9338d9e306aad3f9b4db4c8e4e8738ad718c0f442daf96a37fc864d73954f931dd3c2b3d85663766@3.239.87.70:56304", id: "c0506599f03d41572ecbc8ea45b6eee0192c622eccd7d614d3bb9a3fb19e2548", name: "Geth/v1.1.8/linux-amd64/go1.20", network: { inbound: true, localAddress: "172.18.35.78:30303", remoteAddress: "3.239.87.70:56304", static: false, trusted: false }, protocols: { eth: { version: 68 } } }, { caps: ["eth/66", "eth/67", "eth/68"], enode: "enode://e6ddc59f7f585019b428a3a076a55a2ef1401926434f798b9fb29abb5502a6b33698bfba0420642132a959051f5e417af9abf6d67dc87d8e6f8e88acdbe1532b@54.90.91.58:34482", id: "d85b17d766b71531af5a5a57065ad2baef16f75df801e34ac3e446c9ea02470d", name: "Geth/v1.1.8/linux-amd64/go1.20", network: { inbound: true, localAddress: "172.18.35.78:30303", remoteAddress: "54.90.91.58:34482", static: false, trusted: false }, protocols: { eth: { version: 68 } } }]

eldimious commented 10 months ago

Any idea how can we solve the issue?

eldimious commented 10 months ago

Ι tried to apply https://forum.polygon.technology/t/recommended-peer-settings-for-mainnet-nodes/13018 [p2p.discovery] i will let you know if this resolves the issue

eldimious commented 10 months ago

Above suggestions are not fixing the issue. Any other suggestion?

GeassV commented 10 months ago

Above suggestions are not fixing the issue. Any other suggestion?

no luck. Tried a new physical machine with bor 1.1.0 and Heimdall 1.0.3 with snapshot data. All over again. Stuck randomly. The original one with weeks of manual restarts, finally went well for half month, not sure why, and afraid of unexpected stuck someday

maoueh commented 9 months ago

@0xKrishna I think I might have hit the same problem on two nodes. The first node stop importing blocks ~8d the other around 2 hours ago.

Node Stopped 2 hours ago (Stopped 2024-01-16 @ 18:30:00 EST)

I have the pprof Goroutine dump for it, see pprof.geth.goroutine.polygon-mainnet-0.pb.gz. It seems to be blocked at https://github.com/maticnetwork/bor/blob/master/core/blockchain.go#L1888.

Node Stopped 8 days ago (Stopped 2024-01-09 @ 12:00:00 EST)

I have a pprof too, see pprof.geth.goroutine.polygon-mainnet-1.pb.gz. On this one I don't clearly see what is blocked. I don't even seems to see the blockchain import goroutine there, so not sure what it was doing.

For this dump, I have a bor attach of admin.nodeInfo and admin.peers, see pprof-polygon-mainnet-1-attach-nodeIndo-peer.txt.

Let me know if you need more info, I'll more closely follow the nodes to see if they get stuck again so I could gather extra data points.

Extra Details

I tried to stopped this node cleanly, sending a single SIGINT signal, then waited for 4 hours to stop cleanly but it never happened. I decided to force killing which means in this state, this stuck node never completed the clean shutdown sequence.

VSGic commented 9 months ago

Same issue on two independent nodes, random block stuck with ERROR: heimdalld[14653]: ERROR[2024-01-20|20:45:38.152] Span proposed is not in-turn module=bor currentChildBlock=52556670 msgStartblock=52563456 msgEndBlock=52569855

0xsharma commented 9 months ago

Hey @eldimious @VSGic @maoueh @GeassV ,

We can ignore unable to handle whitelist milestone logs. We are working on suppressing these logs to DEBUG.
I can see your network is peered.
Please downgrade your bor node to v1.1.0 and heimdall to v1.0.3.
Try to restart the clients.
If the issue persists. Please attach a log dump ( or copy last 200 lines of log ) and configuration used to start the nodes.

Thank you! 💜

GeassV commented 9 months ago

Hey @eldimious @VSGic @maoueh @GeassV ,

We can ignore unable to handle whitelist milestone logs. We are working on suppressing these logs to DEBUG.

I can see your network is peered.

Please downgrade your bor node to v1.1.0 and heimdall to v1.0.3.

Try to restart the clients.

If the issue persists. Please attach a log dump ( or copy last 200 lines of log ) and configuration used to start the nodes.

Thank you! 💜

well, stuck at 52755409 and then moved to 52756404 and stuck again when trying to dump the log and config files bor version 1.1.0 and heimdall v1.0.3 attached are the log and config: output_24_1_26.log bor_config.txt

VSGic commented 9 months ago

Hey @eldimious @VSGic @maoueh @GeassV ,

We can ignore unable to handle whitelist milestone logs. We are working on suppressing these logs to DEBUG.

I can see your network is peered.

Please downgrade your bor node to v1.1.0 and heimdall to v1.0.3.

Try to restart the clients.

If the issue persists. Please attach a log dump ( or copy last 200 lines of log ) and configuration used to start the nodes.

Thank you! 💜

Hello, the same problem after downgrade. Regular restart needed attached log and config config_bor.txt out_bor.log

RyanWang0811 commented 9 months ago

Hello,

I have the same issue. The bor node is stuck at block number 52962568.

bor v1.1.0 
heimdall  v1.0.3

I tried to restart the bor node, but it took a long time to try to stop.

Finally, it was killed by systemd for 'stop-sigterm' timed out.

After starting, the block number rolls back to 52921882, it far away from the stuck block number 52962568.

inapeace0 commented 9 months ago

Same here.

VAIBHAVJINDAL3012 commented 9 months ago

@CaCaBlocker You can ignore these logs for now as your node is not completely synced.

VAIBHAVJINDAL3012 commented 9 months ago

@RyanWang0811 Is it working now?

RyanWang0811 commented 9 months ago

It is working now. thx.

VSGic commented 9 months ago

Hi, still have this problem, I restart bor 3-5 times per day

RyanWang0811 commented 9 months ago

Still have this problem, too.

This issue is like what I posted previously and the issue seems not to have been repaired or still has any issue. https://github.com/maticnetwork/bor/issues/939

Is it a node bug? or any issue on the chain?

VSGic commented 8 months ago

Problem still actual, two nodes with different bor versions struggle

Raneet10 commented 8 months ago

Hey @RyanWang0811 @VSGic what specific errors are you facing currently ? Can you share some logs ? Also have you upgraded to bor v1.2.3 ?

VSGic commented 8 months ago

Hello @Raneet10 I have posted logs above here. I have two nodes, one with bor v1.2.3 , and it have the same problem

Excalibur-1 commented 8 months ago

I encountered this problem using the latest version on the testnet, and there is no solution yet。heimdall:v1.0.4-beta,bor:v1.2.6-beta

bgiegel commented 8 months ago

Hello !

Just wanted to mention that we are experiencing the same issues with our 2 polygon bor nodes. I have setup a liveness probe (k8s) to restart the node if it get stuck for more than 15 min. It kinda work but it’s really annoying and we still manage to have small interruptions when both nodes get stuck at the same moment. It happens multiple times per day. It’s really bad.

Anything planned to fix those issues ?

By the way I compared the errors I got in Heimdall and bor logs while it was stuck on a block to the logs I had on the other node that was working. And I found exactly the same error in both. So the issue for sure is not being logged...

VSGic commented 8 months ago

Hello, still have this trouble, we cannot send transactions with such node. They are get lost, when node out of sync. We work with polygon in manual regime

VSGic commented 8 months ago

Hi, still actual, and become worse, one node even cannot get synced after reboot and stucks on the way again

VladStarr commented 8 months ago

Also faced this issue when bootstraping node from official snapshot. Seems that removing nodekey file fixed that problem and sync is now progressing.

VSGic commented 8 months ago

After update to 1.2.7 problem still exists, bor loses sync from blockchain at accidental moment and only reboot pushes it ti start sync from the stuck block, but then it repeats. Removing nodekey did not help

manav2401 commented 7 months ago

Hey, it would be really helpful if we can get the stack trace to see where the bor process is stuck and navigate the root cause. You can get it by the below 2 ways.

Fetch the stack trace directly via IPC

bor attach <path-to-bor.ipc> --exec "debug.stacks()" > stacktrace.txt

Kill the bor process using kill -QUIT <pid of bor> and the logs should have the stack trace

Could you help us with the same? Thanks!

VSGic commented 7 months ago

Hi, @manav2401 See attached files. stacktrace-4 is from bor 1.2.7 stacktrace-8 is from bor 1.2.3

stacktrace-4.txt stacktrace-8.txt

zyx0355 commented 7 months ago

Hello, my current bor client is also stuck in a certain block. The block is 54875999. I rolled back the bor client blocks to 500, 2000, and 15000 and still the problem has not been solved. My client situation is as follows: bor version 1.1.0 and heimdall v1.0.3 The situation of heimdall is as follows: The situation of bor client is as follows: When the bor client just started, it kept looking for peer nodes. After that, the following situation would always occur, and the data could not be synchronized forward. I tried rolling back the bor client: $ bor attach ~/.bor/data/bor.ipc`

debug.setHead("0x10250E8") I tried going back 500, 1000, and 15000 blocks, but the final data block was stuck at block 54875999, and the data could not be synchronized forward. @manav2401 @VAIBHAVJINDAL3012

leoenes commented 6 months ago

Hey guys! bor is stuck in this state now:

Head state missing, repairing number=55,665,375 hash=3510a2..495a1a snaproot=20d014..eff4a3

Hey I think had the same issue, but after trying to fix it, I'm pretty sure I broke my database (it is not starting anymore). I need to download the bor snapshot, but the last image is from February. Is there another link, more recent than this one, to download polygon snapshots?

james-turner commented 6 months ago

I can confirm I’m facing the same issue as described here. I’m running bor v1.3.0-beta-2. Restarting bor seems to solve the problem for a little while but then it stops getting blocks after some time. I’ve yet to rule out potential networking problems. I’m currently based in a home network with port forwarding enabled for p2p ports 30301 , 30303, 26656. it has a very low peer count and that doesn’t seem to improve.

If anyone has any more recent updates could they post them here with things they’ve tried.

zyx0355 commented 6 months ago

Hi,The number of my peer nodes has always been at a very low level, and the quality of the peer nodes found is very poor, so the data cannot be synchronized. Is there any solution?

james-turner commented 6 months ago

I turned the verbosity up on bor to 4 and restarted. It sync'd to head, and then stopped sync'ing. I saw this in the logs:

Apr 12 21:34:46 polygon bor[201293]: INFO [04-12|21:34:46.257] Imported new chain segment               number=55,751,326 hash=638ef5..090cf3 blocks=4    txs=731     mgas=82.068   elapsed=505.778ms   mgasps=162.260 dirty=1022.42MiB

Apr 12 21:34:46 polygon bor[201293]: DEBUG[04-12|21:34:46.257] Inserted new block                       number=55,751,326 hash=638ef5..090cf3 uncles=0 txs=100     gas=8,560,433  elapsed=127.348ms   root=d1f976..f6b459

Apr 12 21:34:46 polygon bor[201293]: DEBUG[04-12|21:34:46.257] Synchronisation terminated               elapsed=1m50.056s

Apr 12 21:34:46 polygon bor[201293]: DEBUG[04-12|21:34:46.263] Unindexed transactions                   blocks=4    txs=357     tail=53,401,327 elapsed=3.014ms

Apr 12 21:34:46 polygon bor[201293]: DEBUG[04-12|21:34:46.268] Reinjecting stale transactions           count=0

Apr 12 21:34:47 polygon bor[201293]: DEBUG[04-12|21:34:47.847] Replaced dead node                       b=8  id=5758c487f99db11c ip=18.171.122.44   checks=0 r=573e27515b6173a4 rip=131.153.232.46

Apr 12 21:34:49 polygon bor[201293]: DEBUG[04-12|21:34:49.935] Deep froze chain segment                 blocks=384  elapsed=241.398ms   number=55,661,326 hash=bfd3ef..6cc77f

Apr 12 21:34:50 polygon bor[201293]: INFO [04-12|21:34:50.660] Got new milestone from heimdall          start=55,750,814 end=55,750,889 hash=0xe317b3273f7b5ba3db2435a4ae7ec3f56a93fb56c5af36d63d6d8142fbf9b736

Apr 12 21:34:53 polygon bor[201293]: DEBUG[04-12|21:34:53.082] Revalidated node                         b=11 id=52a251811399e9f1 checks=1

Apr 12 21:34:54 polygon bor[201293]: DEBUG[04-12|21:34:54.984] Revalidated node                         b=16 id=f4edb64c1c31a642 checks=1

Apr 12 21:34:56 polygon bor[201293]: DEBUG[04-12|21:34:56.409] Revalidated node                         b=8  id=573e27515b6173a4 checks=1

Apr 12 21:34:59 polygon bor[201293]: DEBUG[04-12|21:34:59.355] Served eth_getBlockByNumber              conn=[127.0.0.1:39276](http://127.0.0.1:39276/) reqid=357 duration="92.069µs"

Apr 12 21:35:02 polygon bor[201293]: INFO [04-12|21:35:02.660] Got new milestone from heimdall          start=55,750,814 end=55,750,889 hash=0xe317b3273f7b5ba3db2435a4ae7ec3f56a93fb56c5af36d63d6d8142fbf9b736

Apr 12 21:35:03 polygon bor[201293]: DEBUG[04-12|21:35:03.367] Revalidated node                         b=6  id=579eab95009792f1 checks=2

Apr 12 21:35:05 polygon bor[201293]: DEBUG[04-12|21:35:05.202] Revalidated node                         b=5  id=57b3055bdd011323 checks=4

Apr 12 21:35:11 polygon bor[201293]: DEBUG[04-12|21:35:11.304] RPC connection read error                err=EOF

After another restart i got a bunch of problems with ip table limit:

Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.585] IP exceeds table limit                   ip=65.21.164.117
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.645] IP exceeds table limit                   ip=148.251.142.58
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.645] IP exceeds table limit                   ip=148.251.142.59
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.645] IP exceeds table limit                   ip=65.21.164.126
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.680] IP exceeds table limit                   ip=148.251.142.68
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.680] IP exceeds table limit                   ip=65.21.164.113
Apr 12 21:56:01 polygon bor[201683]: DEBUG[04-12|21:56:01.680] IP exceeds table limit                   ip=65.21.164.117
Apr 12 21:56:02 polygon bor[201683]: DEBUG[04-12|21:56:02.403] IP exceeds table limit                   ip=148.251.142.58

Additionally found this post on geth https://github.com/ethereum/go-ethereum/issues/1563 where peer count is low and getting accepted by other peers don't seem to occur easily.

james-turner commented 6 months ago

Interesting even though my blocks aren't coming through I do get the occassional transaction appear. So obviously some connectivity is still occurring. Where do blocks come from? I thought bor was the recipient of blocks, but is it actually heimdall?

james-turner commented 6 months ago

Ok randomly last night after yes another restart I was able to get blocks through steadily and it hasn't stopped since. The last change I made was to increase the maxpeers=200 in config.toml for Bor. I don't know exactly why this might have been the desired change, but it appears to have worked. I heavily suspect the actual problem is related to getting a steady and useful peer count. I still only have 37 peers which seems extremely low and the number doesn't appear to be going up.

GeassV commented 6 months ago

Ok randomly last night after yes another restart I was able to get blocks through steadily and it hasn't stopped since. The last change I made was to increase the maxpeers=200 in config.toml for Bor. I don't know exactly why this might have been the desired change, but it appears to have worked. I heavily suspect the actual problem is related to getting a steady and useful peer count. I still only have 37 peers which seems extremely low and the number doesn't appear to be going up.

Really thanks, nice try; Mine stuck and not sync even with manual restart. Followed your setting, seems fine now

VSGic commented 6 months ago

Hi everyone. updating setting to maxpeers=200 did not help for me. Still stucks.

james-turner commented 6 months ago

I'm reasonably confident it's a problem with (a) having decent peers (b) having connectivity to those peers and discoverability

Firstly check some things: 1) you have correct ports exposed to the internet (and if you're behind a NAT make sure to port forward 26656 and 30303) 2) once bor is started, do sudo bor attach bor.ipc 3) then at prompt do admin.peers.forEach(p => console.log(p.enode)) 4) take all these nodes that you have a connection with and edit your config.toml to include those are bootnodes and static-nodes 5) restart bor sudo service bor restart

I found that once I had set up correct port exposure and enough peers it seems to continue to work ok.

james-turner commented 6 months ago

Alternatively you can just utilise my list of peers:

enode://f422a5032462e3f1e77a2e5943fb5c1dfa1cf43504d7256a51a8db1fdfb1761abbf78031938b3ba0f4791772fc31107b0230cd8cd6dc9a72b181cf11b337f8e2@131.153.232.37:30329
enode://b6b12c778b02da773ab6e53ff3ed801deceee77d55359234250da276b363f59e28e24b7e1c445e0b58e07aa974e380b6666524c51e09bd45aa5da288459ef518@141.95.35.53:30303?discport=30395
enode://8a4f4600aa8bde4419a250d2137bf779d5fc445fd3e541fa340d267410109e8e70e9b620efcf8f69e0249a5644c07645d52e16e8573cb635638dbe25ccf63329@131.153.238.130:30310
enode://60bcf1e0b04bfa9dfa7223959705eca089fa62b8634c30b77de7210da0b4b3c94a83213cd0fac8db05995d0e614f89b0a768198f134e5cd2157652cba4985962@79.127.216.33:30303?discport=30416
enode://8d00e8e93bd02839850e081e97200587caa3b5b485716bc78115035aece4bad36fd7eeace9501a4a5259a03cca1137319dd772fd7e3dc8c0c7d7c79220e0b4b2@52.4.6.33:30306
enode://94af7827b09d190594db5971bb691b060d3078907be2a3dfb35db066ef921260adf1f510c0e799f2c2e1011790e23e35f77a801a0a5aef8c4ed4896b88b54725@144.76.143.137:30311
enode://854bf9f4520f42eace23d3b12a18194a52303e678f6f01773448b3fad8276dcf78ffbbee39b63a4def0141b0aea8852053536c6fb7b300ea15ed1956a6d840f6@178.63.197.235:30322
enode://b80a035437ff33fe6e0c44b6e5ff130fce595dc0065a5908cf1de392ddf6ff1689b0feb6eeed74bd1cb0f20622e22b0094c76ffaa4d2a4c9bd99abc76ca3d4d5@167.235.187.6:30306
enode://d04a33070702a98e9084f44d5188609c8b2bef61fd47a3d8bd80da3a7f256384af5298a7d4aec844598c39c4865a9938fdc2f3a6871076891791c4cc1856e425@47.128.191.46:30309
enode://41565801b34abe50e5085a7267248684148f1d4b41d9aef38edfeef3fb2703434bc5b8d44ca98263914f840fe8566806c044bec21c1c408f69fbb2a5314a692e@34.205.208.100:30304
enode://274d63a16720bec5a13bb4d5115f64ea2806104f3b77634d2b91d9298e4c3a432168b4e2270cf8bdeb4305c0fe78af91530ac2a9117e0ab3053bace53ce8c8d6@178.63.148.11:30315
enode://735a8029b30f839315d38ff01d5d02ddd5f772d59ec7a0c70b127a8d49791094b1204d01bef9f733c51568499018df647c25e620133b75dbb3bc1412e5b5255b@95.216.127.78:30303
enode://56a40d77ac767dfe9613034530879f60c13b20925ca6b1e384ac7166cb6b7a5def5960b854439875f5f14267530811bcb51b5ee93b38e38f9537642bc9e85cfe@44.207.75.52:30303
enode://a09fb28e22e7d9132359662c9e340dc98682eb98a7658e3aac65be901bea966ca4435d05167d9b065bf8db79a9b220160ef9f241a1e4a0a709853c268082cba4@44.242.14.73:30313
enode://0ff046742976c2afe0aa8bf4c1956eeccb8cb220bda687aab7393a634ee4681318d4ac23a376520cf62b29cfdd3f8de3e8ed274637b645f111ce8a5dd21bb78d@52.195.35.130:30310
enode://2ad3521be3c7528080e614827eb46ffc75d25e1ead07f40539c58bf534a195c9f3d3b493c29316d9b6efdfc4fa3a1d424ba54e3e89df9ae3c3f2f56626ef026c@131.153.238.118:30504
enode://376a0c86dd56c012199402d8636174fe9f565567fb6cb45c55dd50f9af147e3c77cfadc41325b940094a583aede5578cd91bd7fbb2d9516bb1e47f33a709f3fa@88.198.99.100:30304
enode://ac1ad52dca196112693ff95ff1c550a4402e6a0d0117661ef0dc2c49e15e4046b9a3c5388a5e506bbfbbcd49cbb384de7018f6d982f01b3cc1511353d77340bf@18.219.134.46:30318
enode://bedbd7199f27fa184f9b791c6800725069d21ec76e95760877810c440c0252a1f0234345feb6d9fccbe9c61da3dd92e143d72ace0c4fff931a29bfdb85f68037@52.49.125.50:30303
enode://49c9f3a6f4a1c0cf526b1cb95e61cb389f6dd63a39f70cdb53e1072d290e9aedfda1370b69a13f0266ce2ea9c76f93e66d2e504b30c72f1432ee7d3ee07fc67f@3.120.215.154:30318
enode://15421dfe508c8bcaf4df0d5cb7b70424c283aa66875b778d482c938efd796314cb018246cf5cfe473f745f49b985afc5dcafcdaad333c788ba4b97e357c51853@135.181.236.61:30303?discport=30498
enode://e83278b3f8c8f0ed6b7c0d6311549e29e0b674b87cbf60544fba244ec3248360b3f3b0e087ad41527151fd255b75a16ef3e2444145d0e9e4377796149029c50c@141.95.31.164:30303
enode://50219d6315852398feaf9762d292daa9367d281ef6b48fe0b9ecc7075466a23cc36dadb58765fbd127d04b5bc2325a41437934db22110c14b557e77a0094a4a7@52.195.48.20:30319
enode://2dd198c54ed9bb191fed772a391e5b777aaa7145ced7e6dd0e2de816bd3527ba711bbb4d02fd43515e652209ef1dbb7c8cd577a5f486adda09aadbc1291b6cb3@3.9.224.239:30324
enode://455186fa23714d4a40977784f044f146d8062dd9c65849c9bda0040fdd700a7655ddaf892ca81e8602dd1a08e19f7e3c14dbe80af16823f867577a937591de96@148.251.142.70:30314
enode://b75bce4d1a3e3460790e13a46253674314a3a5d9161ebd1cf3ef5212a5d4cf332bd6973e5d7856c24e15e17aa7adafafb89c303c7f8b7813b37718471f03a099@3.8.45.159:30317
enode://d1e05cd722d5c5734a55049065d67c81545e0e5b46f2338734737d3817c5a9b1c0c0baac401ef169398ea50e68c798b7438bc622eb2d2c58b8f2398342e8be07@35.169.31.203:30303?discport=30371
enode://1825021e6229a8acddaaa28c16b5a33dcadd6e3153c53ce8dc1f0d7dc8abc5422e98ffe56bfcae6d1e2ce8a0325ce1666ec743aebd8aff3b20799a87e6c467eb@54.38.76.225:30303?discport=30313
enode://185771dc3a43b298b2da6d9ac26801d80d754111e63689b549b002dc43b096455ff6a7ccb7483ffd112bdbe746b37d49716c2787919bda1bfe111aef36bb9a96@148.251.194.252:30303?discport=30548
enode://1c3a5773431e82a1f5a2325b5475ee2ddff84a489480c977280898ffe3a6fdee85b480e5f7cb0a4bf88881cddd73a2abac1542c191ca51fd2bbd271a69eedf28@87.249.137.89:30303?discport=30330
enode://f7e003951d9eee96bf460065ef0018a9e758e1dd9a14e13add07b0fb6a22369779e112e6c2b4ebd30f5f4a37a2013976b2bd75a54ad6176269011d2dd9589785@84.17.38.164:30303?discport=30523
enode://ad0e380cb6d23f6126350c7c03f3a35177d6e428436f605e633b96f5437f910b1ea93e4462e612a5c9af116d09eda1ceae1a3429f48f269173262d9d828ed924@107.6.141.6:30309
enode://9d79af7012353cf5c0f1f48549327ee3e9548ecc658beccb79ec595e1e0dde2d2540de11ca19236a047813aaebd74181646c3039e086348492f0e46801562489@51.21.125.132:30313
enode://188c1273f9f25cad8dc26040e252c2e8b92c1ddf79f7f659fcb59b65f1b222d1c14a116853c30fc66cc894b7125c4915d002209067d1a7d86d64f495b265e293@13.125.0.203:30307
enode://a341923aa8d22f2018106cabcdb22b14a271701ea80b06cd7737c1bc299ac42e06387f5381eeebe67f2517b4f11a1b464adb2465bf9eebc1abf52d0e3eb8b9dd@13.57.125.97:30320

Put these in your bootnodes/static-nodes and hopefully you will get running. Periodically i check the set of connected nodes I have and log them, then utilise those as bootnodes in the hope i can find more.

VSGic commented 6 months ago

@james-turner Thank you for explanations, I have followed your instruction, it did not help finnally, but behavior of node have changed. Now there is no stucking, but irregular rising and going down lag from the blockchain blocks.

GeassV commented 6 months ago

Ok randomly last night after yes another restart I was able to get blocks through steadily and it hasn't stopped since. The last change I made was to increase the maxpeers=200 in config.toml for Bor. I don't know exactly why this might have been the desired change, but it appears to have worked. I heavily suspect the actual problem is related to getting a steady and useful peer count. I still only have 37 peers which seems extremely low and the number doesn't appear to be going up.

Really thanks, nice try; Mine stuck and not sync even with manual restart. Followed your setting, seems fine now

Well, it survived for about two weeks and then stuck again, syncing for a while and stuck repeatly. Tried the above solution of listing peers(plus some from site: https://polygonscan.com/nodetracker/nodes) in the config file, not work. Also tried to re-download snapdata for bor, which falls pretty far behind to Feb, still stucks So now, my two full nodes of polygon are dying and I cannot help

0xsajal commented 6 months ago

Hey all,

We are no longer providing snapshots for the community. Instead, we have transitioned to a community-driven model where snapshots are provided by some of the most active members. These include the following validators: Vault Staking (Mainnet/Mumbai), Stakepool (Mainnet/Amoy), StakeCraft (Mainnet/Mumbai/Erigon Archive) and Girnaar Nodes (Amoy).

Also, StakeCraft has introduced a new service for the Polygon community - All4nodes.io aggregator service where snapshots from all different providers can be found. More details here: https://forum.polygon.technology/t/stakecraft-introduces-a-new-service-for-polygon-community-all4nodes-io-aggregator-service/13694/1

This decision has been made to foster greater community involvement and to distribute responsibilities more equitably among our dedicated community members. Empowering the community to generate snapshots will not only ensure their timely availability but also promote collaboration and engagement within our community.

For inquiries, contact community services directly.

Regards, Team Polygon Labs

VSGic commented 6 months ago

So the problem to download fast snapshot for bor is secondary now, more serious is bor regular stuck, because it is not possible to provide transactions with such nodes and this is affects business

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

VSGic commented 5 months ago

There is other issues about this problem, so I think this is still actual

maticnetwork / bor