Finality Lagging issues in Kusama archive & full nodes

Imod7 commented 1 year ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Description of bug

The Integrations Team at Parity was contacted by a team who is running KSM nodes and they reported sync isssues as soon as they upgraded to v.0.9.37.

A small summary of the issues they faced & the solution they found :

While running their KSM nodes in v.0.9.36 everything was very stable and no sync issues.
As soon as they upgraded to v.0.9.37 their peer count dropped significantly.
They also faced finality lagging and the number of best block was always much larger than the finalized block.
Also, the finalized block number was being updated every few minutes.
In previous versions these two numbers were updated almost synchronously.
They added -l afg=trace,sync=debug flags to get more detailed logs and restarted the node
After the restart, the number of peers increased (before the restart only < 3, after > 10).
Restart helped for some period of time but the number of peers dropped again, finally only 0 or 1.
Again finalized blocks were falling behind.
They downgraded to v.0.9.36 and now it looks like there is no lagging.

Their Setup

They run Archive & Full Nodes
Regarding their Infrastructure, Storage and Memory/CPU resources are dynamically provisioned and scale on-demand therefore system resources do not seem to be a reason.

Flags

Two examples of the flags they are using when running their nodes :

/usr/bin/polkadot --name $NAME_TEST --chain=kusama --database=RocksDb --rpc-cors all --pruning=100000 --prometheus-external --ws-external --rpc-external --wasm-execution Compiled -d /home/kusama/data

and :

--name kusamanetwork --chain=kusama --ws-external --rpc-external --rpc-cors=all --pruning archive

Lagging from Log

One output they shared from their logs that shows the sync issue / lagging :

Idle (4 peers), best: #16372008 (0x17d2…107b), finalized #16371948 (0x6855…620d)

Errors from Logs

After adding -l afg=trace,sync=debug and restarting, they shared their logs so I paste here some of the errors found there :

2023-01-30 10:28:01.569 ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID
2023-01-30 10:28:02.518 DEBUG tokio-runtime-worker sync: New peer with unknown best hash 0xc94a…812f (16415165), searching for common ancestor.
2023-01-30 10:28:02.518 DEBUG tokio-runtime-worker sync: Connected 12D3KooWHFkuQi1...
2023-01-30 10:28:03.166 DEBUG tokio-runtime-worker sync: Propagating transactions
2023-01-30 10:28:03.813  INFO tokio-runtime-worker substrate: 💤 Idle (1 peers), best: #16415164 (0xd460…70e1), finalized #16415054 (0x3714…0c4d), ⬇ 38.1kiB/s ⬆ 2.0kiB/s
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: Reannouncing block 0xc94a9a2018e27e320683c6c1bf... is_best: true
2023-01-30 10:28:04.517  INFO tokio-runtime-worker substrate: ✨ Imported #16415165 (0xc94a…812f)
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: New best block imported 0xc94a9a2018e27e320683c6...
2023-01-30 10:28:04.517 DEBUG tokio-runtime-worker sync: 12D3KooWHF... disconnected
2023-01-30 10:28:04.975 ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID
2023-01-30 10:28:06.066 DEBUG tokio-runtime-worker sync: Propagating transactions

...

2023-01-30 10:07:25.064  WARN tokio-runtime-worker sync: 💔 Error importing block 0x74bdb741923407487154...: block has an unknown parent
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: Restarted with 16414958 (0xe9c8…2882)
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: New peer with known best hash 0x87f2…c82e (16414957).
2023-01-30 10:07:25.064 DEBUG tokio-runtime-worker sync: New peer with unknown best hash 0x2cf8…74d3 (16414958), searching for common ancestor.

If needed, we can share internally the full list of logs cc @IkerAlus @CurlyBracketEffect @Imod7

Similar Past Issues ?

I am adding here some past issues I found that look similar :

Related Current Issue ?

There is one issue that was brought to my attention by @gabreal (devops) and it looks like it is related polkadot-sync-0 + kusama-sync-0 lagging behind paritytech/substrate#26

Steps to reproduce

Upgrade a KSM node from v.0.9.36 to v.0.9.37

melekes commented 1 year ago

New peer with unknown best hash 0xc94a…812f (16415165), searching for common ancestor.

There's likely a bug in ancestry search as indicated by https://github.com/paritytech/polkadot-sdk/issues/528

error importing block 0x74bdb741923407487154...: block has an unknown parent

this is interesting because it's not clear to me why the remote would choose to provide us a block with an unknown parent. would be good to see debug logs for this case.

bkchr commented 1 year ago

this is interesting because it's not clear to me why the remote would choose to provide us a block with an unknown parent. would be good to see debug logs for this case.

The remote is not deciding which blocks it is providing to us. We are actively requesting blocks from the remote. It could either be a bug in sync where we drain the received blocks incorrectly and did not imported the parent before or the parent failed to import. Or we requested something for that we don't have the parent.

altonen commented 1 year ago

This is probably related to all the other syncing-related issues we've seen recently. We're working on a couple of fixes

BulatSaif commented 1 year ago

After the restart, the number of peers increased (before the restart only < 3, after > 10).

Restart helped for some period of time but the number of peers dropped again, finally only 0 or 1.

I have a similar observation on our node.

Old behavior (v.0.9.36): If p2p port is closed by the firewall or you running behind NAT, the node is working properly. The sync peer number is equal to the value set in the --out-peers=N. New behavior(>v.0.9.37): If p2p port is closed the peer number is jumping between 0 and 10. A node can stale for 10-15 min and then keep syncing.

For example, the screenshot of the node which had an incorrect --listen-addr flag (wrong port) and "--in-peers=150" "--out-peers=150" Old behavior v.0.9.36: sync peers = 150-160

New behavior:(v.0.9.37) sync peers = jumping between 0-10 Screenshot from 2023-03-06 19-12-22

Fix: After flag --listen-addr was fixed the sync peers grew to 300. Zoom of fix moment:

altonen commented 1 year ago

@BulatSaif thank you this is very valuable information

I wonder if this has something to do with libp2p because it was update for v0.9.37

bkchr commented 1 year ago

@BulatSaif thank you this is very valuable information

I wonder if this has something to do with libp2p because it was update for v0.9.37

@altonen maybe we should test if downgrading fixes it and then find out what the difference is.

altonen commented 1 year ago

@BulatSaif thank you this is very valuable information I wonder if this has something to do with libp2p because it was update for v0.9.37

@altonen maybe we should test if downgrading fixes it and then find out what the difference is.

I've been prioritizing the CPU usage issue since adjusting peer ratios should've bought us some time with this "node keeps losing peers" issue and because I believe we can deploy a fix the CPU issue for relatively soon that would take some of the load off of NetworkWorker but I will get back to this issue soon. I hope it's something we did in Substrate and not something that changed in libp2p.

Imod7 commented 1 year ago

The same team reported low peer issues this morning after upgrading their KSM nodes to v0.9.41. Here are a few logs that they shared :

2023-04-19 04:04:30 💤 Idle (1 peers), best: #17545073 (0x1f8f…fa20), finalized #17544946 (0x4388…d267), ⬇ 7.3kiB/s ⬆ 2.3kiB/s
2023-04-19 04:04:35 💤 Idle (1 peers), best: #17545073 (0x1f8f…fa20), finalized #17544946 (0x4388…d267), ⬇ 1.3kiB/s ⬆ 1.7kiB/s
2023-04-19 04:04:37 ✨ Imported #17545074 (0x0455…6689)
2023-04-19 04:04:40 💤 Idle (1 peers), best: #17545074 (0x0455…6689), finalized #17544946 (0x4388…d267), ⬇ 21.8kiB/s ⬆ 1.5kiB/s
2023-04-19 04:04:40 ✨ Imported #17545075 (0x0b0d…4427)
2023-04-19 04:04:45 💤 Idle (0 peers), best: #17545075 (0x0b0d…4427), finalized #17544946 (0x4388…d267), ⬇ 6.2kiB/s ⬆ 1.4kiB/s
2023-04-19 04:04:50 💤 Idle (0 peers), best: #17545075 (0x0b0d…4427), finalized #17544946 (0x4388…d267), ⬇ 0.9kiB/s ⬆ 3.7kiB/s
2023-04-19 04:04:52 ✨ Imported #17545076 (0x6821…5488)
2023-04-19 04:04:52 ✨ Imported #17545077 (0xb928…4823)

I will share more detailed logs as soon as/if they are provided.

One interesting fact was that these issues appear only in the US and the Tokyo region. However, their nodes running in the EU (Ireland) are stable enough and they have at least 9-15 peers connected at any given time.

altonen commented 1 year ago

There will be a PSA related to these low peer counts with the next release. Basically if the team is experiencing this and they're running the nodes behind a NAT, it would be a good idea to test opening the P2P port for inbound connections. The reason why the node is only able to have 0-1 peers is because the public nodes reachable to it are fully occupied, that is not an issue with the node implementation but that the network has too many nodes. If they have the P2P port open for inbound connections and they're still experiencing this, then the issue is something else. In that case, -lsync,sub-libp2p=trace logs are needed

Imod7 commented 1 year ago

Thank you @altonen for the quick feedback! I forwarded your suggestions to the three teams who are experiencing the same/similar issue (low peer count/syn issue in v0.9.41).

First team cannot make any of the changes you suggest due to strict internal security restrictions. We are waiting for the detailed logs after adding trace flags when running their node.
I did not have any feedback from the second team on whether they will implement your suggestions. However, earlier they shared their flags and logs so I am adding them here in case it is helpful for you.
```
"--chain=kusama", "--no-prometheus", "--no-telemetry", "--unsafe-rpc-external", "--unsafe-ws-external", "--rpc-methods=unsafe", "--rpc-cors=all", "--pruning=archive", "--in-peers=200", "--out-peers=50", "--max-parallel-downloads=30", "-lsync,sub-libp2p=trace"
```
Their sync will still get stuck at a certain height and continue to synchronize for a long time.

Their logs after adding trace flags ksm1.log
The third team will try your suggestion now so I will keep you updated on the impact that this may have.

altonen commented 1 year ago

If they can't open the inbound port, even temporarily to rule out that it's not an issue with inability to accept inbound connections, there's very little we can do about this. If they can give at least trace logs, there might be something there.

I looked at the trace logs and they seem to have a same problem: --out-peers 50 (which btw please don't do, the default values should be kept and --in-peers should be increased) but they still have very often between 0-5 peers. After v0.9.42 is released, it may drop even further because SyncingEngine will keep evicting idle peers.

I await for a response from the third team.

Imod7 commented 1 year ago

First team also shared logs after adding -l afg=trace,sync=debug flags : ksm-v0.9.41-low-peers.log.tar.gz

There is this error : ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a peer ID of an unknown peer \n","stream":"stderr","time":"2023-04-19T..."} which I think you are already aware of since it was also mentioned in this issue #6228
and the same warning mentioned in the description : WARN tokio-runtime-worker sync: 💔 Error importing block 0x7026bac89f13cf68074ef15......: block has an unknown parent \n","stream":"stderr","time":"2023-04-19...."}

altonen commented 1 year ago

The first issue should be fixed by this and it will be included in the next release.

In the logs there is this pattern that repeats constantly:

2023-04-20 03:01:43.409  INFO substrate: 💤 Idle (4 peers), best: #17558784 (0xca4e…dae8), finalized #17558755 (0xe754…58bb), ⬇ 2.0kiB/s ⬆ 2.0kiB/s
2023-04-20 03:01:44.393 DEBUG sync: New peer with unknown best hash 0x1641…7b9b (17558785), searching for common ancestor.
2023-04-20 03:01:44.393 DEBUG sync: Connected 12D3KooWGBnyipqbontcHFyXP4Z991T1iG6fPiDrPZ9UrcWmwusY
2023-04-20 03:01:45.288 DEBUG sync: New peer with unknown best hash 0x1641…7b9b (17558785), searching for common ancestor.
2023-04-20 03:01:45.288 DEBUG sync: Connected 12D3KooWQnHNy9p4Ydrd9BkE7oHfVgHf1taAWwWLPMkNgUvq2fdc
2023-04-20 03:01:45.688  INFO substrate: ✨ Imported #17558785 (0x1641…7b9b)
2023-04-20 03:01:45.688 DEBUG sync: Reannouncing block 0x16417327dc9dfa207a035124dd6190f851f005e2f43a7080addb12745a937b9b is_best: true
2023-04-20 03:01:45.688 DEBUG sync: New best block imported 0x16417327dc9dfa207a035124dd6190f851f005e2f43a7080addb12745a937b9b/#17558785
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWDUJZ7TpuPuCTxGfp2Y5gxApnJQS88p2pvZmcZQPd6NiH disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWBsJKGJFuv83ixryzMsUS53A8JzEVeTA8PGi4U6T2dnif disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWQnHNy9p4Ydrd9BkE7oHfVgHf1taAWwWLPMkNgUvq2fdc disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWEpotdFJjvs5tLopnJoM6YpyeKJgSPodJykS1DFHKhU6X disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWGBnyipqbontcHFyXP4Z991T1iG6fPiDrPZ9UrcWmwusY disconnected
2023-04-20 03:01:45.689 DEBUG sync: 12D3KooWFj2ndawdYyk2spc42Y2arYwb2TUoHLHFAsKuHRzWXwoJ disconnected

The node imports a block, announces it and soon after peers start disconnecting. This is because right now the SyncingEngine falsely believes it has paired with these nodes when in fact it has been rejected, most likely because the full node slots of the remote peers are already occupied and the outbound substream from local node's PoV is closed. It is not detected until the node actually sends something so it then tries to announce this new best block to its peers and notices that the substream is closed and removes the peers.

I believe the underlying problem to be with fully occupied slots. The fact that it's observable in Japan and the US but not in Europe is interesting and I'll ask DevOps if we could have access to a VM in these problem regions.

johnsonco commented 1 year ago

ksm-low-peers.log parity/polkadot:v0.9.42, deployed in AWS ap-northeast-1 Tokyo.

Hey team, we're still having the same sync issue after we opened the 30333 port. The logs with -l afg=trace,sync=debug flags are above.

Full command: --chain kusama --unsafe-rpc-external --unsafe-ws-external --rpc-cors=all --pruning=100000 -l afg=trace,sync=debug

altonen commented 1 year ago

@johnsonco

can you run the node with -lsync,sub-libp2p=trace? You're running v0.9.42?

Imod7 commented 1 year ago

@johnsonco reported that their node is syncing properly now. It required approximately 12 hours after they opened the p2p port to sync correctly. @altonen I shared the logs (with -lsync,sub-libp2p=trace flags) of the team in this google drive folder

altonen commented 1 year ago

It is quite interesting, the node is getting block announcements and its best block is increasing but the finalized block is not. There is another bug report opened today that is very similar to this: https://github.com/paritytech/polkadot-sdk/issues/632

There's a suspicious pattern in the logs, the nodes keeps exchanging some message(s) over the GRANDPA substream hundreds of times a second:

{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.835 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.835 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.836 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
{10:54:22.836 sub-libp2p: Handler(PeerId(\"12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2\")) \u003c= Sync notification
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2)
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))
{10:54:22.845 sub-libp2p: Handler(ConnectionId(180)) =\u003e Notification(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2), 150 bytes)
{10:54:22.845 sub-libp2p: External API \u003c= Message(12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2, SetId(2))

So it could be that it's actually getting finality notifications but they're buried under whatever this flood of other messages is. Then at some point 12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2 disconnects which could be either a normal disconnection or the the local node got banned if it advertised duplicate messages.

@andresilva do you have any thoughts on this?

andresilva commented 1 year ago

The grandpa log target was changed from afg, so I'd need some logs with -lgrandpa=trace to be able to figure out what's happening there.

Imod7 commented 1 year ago

The team shared new logs (with -lsync,sub-libp2p=trace and -lgrandpa=trace flags). The new file ksm-logs2.tar.gz is uploaded in the same google drive folder.

Please note that the new logs is after the issue was solved so the node no longer had the peer issue. However, I still requested the file in case it can show any interesting pattern/weird behaviour that is not causing any issues now but might affect the peer count again in the future.

andresilva commented 1 year ago

Indeed these logs don't show anything abnormal but that's expected since everything is working properly there. Could still be helpful to keep running with those log targets in case the issue happens again.

So it could be that it's actually getting finality notifications but they're buried under whatever this flood of other messages is. Then at some point 12D3KooWSxi7MybyBPiU5MUx4vmJFv3d6iEdh2mw6Byj9Ha1Xww2 disconnects which could be either a normal disconnection or the the local node got banned if it advertised duplicate messages.

@andresilva do you have any thoughts on this?

I have confirmed that these 150 byte messages are indeed GRANDPA votes. Since we didn't have grandpa=trace logs I can't tell whether those are duplicate / stale votes that some peer is spamming us or not, if they were then the disconnection could be legitimately triggered by us (due to peer reputation getting too low).

bhargavtheertham-cb commented 1 year ago

Team, we are facing the same issue, how do we open the p2p port?

Imod7 commented 1 year ago

The P2P port by default should be 30333. You have to manually open it in the firewall/NAT settings to allow inbound connections.

An example that I have at hand is from the ansible scripts I used to setup my validator : p2p_port: 30333 and: ufw allow {{ p2p_port }}/tcp However, this is because I use ufw firewall so in your case it depends on what setup you have.

altonen commented 1 year ago

Since the node is working fine, I'm not sure how much interesting information there is about a possible issue but sadly these logs don't help me debug this further. GRANDPA just tells it received a message but I don't know what that message is and whether it's expected behavior.

07:07:40.150 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.150 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes
07:07:40.150 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))
07:07:40.151 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes)
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 grandpa: Polling round 559, state = State { prevote_ghost: None, finalized: None, estimate: None, completable: false }, step = Some(Prevoted)
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 grandpa: Round 559: Got incoming message
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.151 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))
07:07:40.151 sub-libp2p: External API =\u003e Notification(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\"), OnHeap(\"/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/grandpa/1\"), 150 bytes)
07:07:40.151 sub-libp2p: Handler(PeerId(\"12D3KooWKdovqTGfjXRdtoSG4QY5ctGQprUsuReYxyujgE16ZcQa\")) \u003c= Sync notification
07:07:40.152 sub-libp2p: Handler(ConnectionId(108)) =\u003e Notification(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2), 150 bytes)
07:07:40.152 sub-libp2p: External API \u003c= Message(12D3KooWC4iEzYnBiQFjZ5GRY84Wc6s7BytAyYc3zky3mUf7yTAt, SetId(2))

Imod7 commented 1 year ago

The team of @bhargavtheerthamcb is also experiencing the peer/sync issue.

Summary of their Setup & Changes

They were setting up a new set of polkadot & kusama nodes in Kubernetes and they noticed low peers. Nodes location : US.
They have other polkadot nodes not running in K8s and those nodes do not have the peer problem. Configuration between the two setups are same except the nodes not running in K8s have more CPU/memory.
In K8s they are starting with 5 cores and 25Gb of RAM but neither is being fully utilized as per our last communication.

They are running the nodes with the commands


--rpc-port 19933 --rpc-external --ws-external --rpc-cors all --no-telemetry --pruning archive --chain=polkadot --in-peers 50 --out-peers 50

--rpc-port 19933 --rpc-external --ws-external --rpc-cors all --no-telemetry --pruning archive --chain=kusama --in-peers 50 --out-peers 50



- I indicated to change the values of `in` and `out` peers as per the suggestion [here](https://github.com/paritytech/substrate/issues/13295#issuecomment-1514547660) and open the p2p port.
- They reported that they opened the p2p port, `30333`.
- The issue of low peers is persisting.
- Their logs (with `-lgrandpa=trace`and `-lsync,sub-libp2p=trace`) were uploaded 
    - in the same [google drive](https://drive.google.com/drive/folders/1ESrvispxCpoFDFqEZYkSeuRsOsGudInt?usp=share_link) 
    - filename : `cb-debug_logs`

altonen commented 1 year ago

I looked at the logs and the node is only establishing outbound connections, i.e., the role is always Dialer meaning there are seemingly no inbound connections to the node. The problem is probably in the Kubernetes configuration, please ensure that the nodes are able to accept inbound connections. You can test this relatively easily by spinning up the cluster and then from the outside, try and connect to one of the nodes running inside of it.

BulatSaif commented 1 year ago

@Imod7 which helm chart are they using to set up a node in Kubernetes? if it is this paritytech/helm-charts try to set up:

node: 
  perNodeServices:
    relayP2pService:
        enabled: true
    setPublicAddressToExternalIp:
      enabled: true

This will proxy port 30333 as Kubernetes NodePort and set --public-addr to correct value:

--public-addr=/ip4/${EXTERNAL_IP}/tcp/${RELAY_CHAIN_P2P_PORT}

bhargavtheertham-cb commented 1 year ago

@altonen , upon deeper investigation, we are noticing the same finality lag and syncing issues in our non - k8s env also. We noticed that our deploy times ( restore a latest snapshot and catch up to tip) is not taking several days whereas a month or two ago it consistently completed in 30 mins or so. In the non k8s env, we didnt have 30333 open and it had worked fine for couple yrs. Is opening the port 30333 a new requirement? Also curious on how opening the port helps keep stable peers.

altonen commented 1 year ago

Are you able to pinpoint a release when the deploy times started to degrade?

I'm not sure if we're talking about the same thing but opening the P2P port is not a requirement but if you have a low number of peers and that negatively affects the node, then opening the port should help with that.

SimonKraus commented 1 year ago

As discussed in https://github.com/paritytech/polkadot/issues/7259 I'm providing the startup logs and first few minutes of grandpa trace.

The node is generating approx 100k lines of log per minute with -lgrandpa=trace.

The following is a graylog export (just in case you wonder about the format) https://drive.google.com/file/d/1-Cx-lKDQsuuLrapIKdIilOS4OxO-9DBm/view?usp=sharing

bhargavtheertham-cb commented 1 year ago

@altonen the issue seems to have started with 0.9.41. We are still not seeing any improvement in our nodes' stability / availablity. We are using our own helm chart and not using the above one. We have opened up the P2P port but it doesnt seem to help. Validated that port is open

telnet xx.xx.xx.xx 30333 Trying xx.xx.xx.xx .. Connected to ipxx.xx.xx.xx ec2.internal. Escape character is '^]'.

We had zero issues with peering or node stability before 0.9.41 and the nodes were very stable. Are other customers having the same issue with nodes ? This is becoming very critiical for us.

altonen commented 1 year ago

Can you give -lsync,sub-libp2p=trace logs? Something doesn't make sense. As of right now your report is the only open one about peering issues.

bhargavtheertham-cb commented 1 year ago

Hi @altonen The logs were already uploaded. Its in one of the threads above. Please let me know if you are not able to access it. google drive

altonen commented 1 year ago

I've looked at these logs once already and the node was only able to establish outbound connections. I believe this is further corroborated by the fact that the node is only showing the four following listen addresses:

2023-05-20 15:55:43.716 TRACE tokio-runtime-worker sub-libp2p: Libp2p => NewListenAddr(/ip4/127.0.0.1/tcp/30333/ws)    
2023-05-20 15:55:43.716 TRACE tokio-runtime-worker sub-libp2p: Libp2p => NewListenAddr(/ip6/::1/tcp/30333/ws)    
2023-05-20 15:55:43.716 TRACE tokio-runtime-worker sub-libp2p: Libp2p => NewListenAddr(/ip4/10.86.61.87/tcp/30333/ws)    
2023-05-20 15:55:43.716 TRACE tokio-runtime-worker sub-libp2p: Libp2p => NewListenAddr(/ip6/fe80::80b5:a3ff:feac:d83c/tcp/30333/ws)

which I believe are all local addresses. Since the last time you provided logs, you have made no changes to your network configurations? Where is the node located? Instead of telnet, can you try connection to the node using another Polkadot node by setting the node that is having trouble with peers as a reserved peer with --reserved-peers <addr> and checking whether the connection is established?

bhargavtheertham-cb commented 1 year ago

Hi @altonen the one change we made after uploading logs was to open up the P2P port. The nodes are located in the US East region. I can try using the --reserved-peers parameter from a different node pointing to the prod pods. Only challenge is in K8s the pod IPs can change if they reboot but can certainly try. Do we need to collect the traces while doing the above?

altonen commented 1 year ago

-lsync,sub-libp2p=trace logs for both nodes would be very useful

bhargavtheertham-cb commented 1 year ago

@altonen Would you be available for a quick call on monday US time so we can discuss this? thanks

altonen commented 1 year ago

Sure, I'm available until 9PM EEST

bhargavtheertham-cb commented 1 year ago

Thanks @altonen would you please share your email, So I can send an invite. Thank you.

bhargavtheertham-cb commented 1 year ago

Hi @altonen @Imod7 , Over the last month we were working to set up a new k8s cluster with nodes that have public IP assigned so other nodes can talk to it. I deployed a polkadot node in the new cluster and enabled tracing (attached). Last time you reviewed the logs, you mentioned that you do not see any incoming connections. Can you please review this one and let me know if you see what you expect to? thanks logs-after-change.log

altonen commented 1 year ago

@bhargavtheertham-cb

The node didn't get any inbound connections, only dialed other nodes which is not itself suspicious since the node was running for only 2 minutes and if it was run with a fresh PeerId, it takes a while before it'll be in the DHT.

But these are the listen addresses the node discovered for itself:

2023-08-14 02:20:54.083 sub-libp2p: Libp2p => NewListenAddr(/ip6/::1/tcp/30333/ws)    
2023-08-14 02:20:54.083 sub-libp2p: Libp2p => NewListenAddr(/ip4/127.0.0.1/tcp/30333/ws)    
2023-08-14 02:20:54.083 sub-libp2p: Libp2p => NewListenAddr(/ip6/fe80::94db:f6ff:fe47:189f/tcp/30333/ws)    
2023-08-14 02:20:54.083 sub-libp2p: Libp2p => NewListenAddr(/ip4/10.230.79.235/tcp/30333/ws)

which are all local addresses and will be ignored by other nodes so your node won't be able to accept inbound connections.

bhargavtheertham-cb commented 1 year ago

@altonen , Thanks . how do we configure the listen addresses on the public IP? I saw some messages in the log that said

2023-08-14 02:43:05 Accepting new connection 3/100
2023-08-14 02:43:05 Accepting new connection 4/100
2023-08-14 02:43:05 Accepting new connection 5/100
2023-08-14 02:43:05 Accepting new connection 6/100

Is the outbound connection?

Imod7 commented 1 year ago

@bhargavtheertham-cb Since your environment is k8s, can you please first check your helm chart/configuration files and see if the values are the same as the ones mentioned by @BulatSaif in the previous comment here ?

bhargavtheertham-cb commented 1 year ago

we are not using the same helm charts as them but I am looking at the code to see how we set the public-addr to the command startup.

bhargavtheertham-cb commented 1 year ago

@altonen I have set the public addr correctly now. Please review the logs .

2023-08-14 13:58:49.430  INFO main sub-libp2p: 🏷   Local node identity is: 12D3KooWBY9x989zmcfe7SjYRKTdkAoxiW3Xwaof135syZoMArTL
2023-08-14 13:58:49.431  INFO main sub-libp2p: 🔍 Discovered new external address for our node: /ip4/54.172.189.34/tcp/30333/p2p/12D3KooWBY9x989zmcfe7SjYRKTdkAoxiW3Xwaof135syZoMArTL

latest.log

altonen commented 1 year ago

@bhargavtheertham-cb In case the node does have a public address, please try running it longer with the same PeerId (as you're doing, at least according to two latest logs) so other nodes have a change to discover it through DHT and can attempt establishing a connection to it. Right now it only showed inbound connections.

The logs start with this line: Executing: /app/polkadot/target/release/polkadot --rpc-port 19933 --rpc-external --rpc-cors all --no-telemetry --pruning archive --chain=polkadot --in-peers 50 --out-peers 50 -lsync,sub-libp2p=trace

You may need to provide --public-addr, as instructed in https://github.com/paritytech/substrate/issues/13295#issuecomment-1556138764

bhargavtheertham-cb commented 1 year ago

Hi @altonen I did add the public-addr flag to the end of the command which is why it is able to listen on it . I will fix the echo command which was not updated to include the latest flag. I will try running it for longer

bhargavtheertham-cb commented 1 year ago

Hi @altonen, @Imod7 . We make some tweaks to our setup so the public IP is now reachable externally. Would you please review this fresh set of logs to see if you are seeing inbound connections?.. final.log

altonen commented 1 year ago

@bhargavtheertham-cb

it looks to be working now, there are few lines like this: 2023-08-14 23:53:52.598 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Connected(12D3KooWMJB9X1hBaFoP9ZTQfB4ZaXfg9iifckFGC2Vvyx3NPiwt, SetId(0), Listener { local_addr: "/ip4/10.230.76.47/tcp/30333/ws", send_back_addr: "/ip4/209.195.12.133/tcp/39508/ws" }, ConnectionId(812)): Not requested by PSM, disabling.

meaning 12D3KooWMJB9X1hBaFoP9ZTQfB4ZaXfg9iifckFGC2Vvyx3NPiwt dialed and reached your node

altonen commented 1 year ago

Closing as resolved since the finality hasn't been lagging after https://github.com/paritytech/substrate/pull/13829

paritytech / substrate