paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 660 forks source link

Kusama parachains block production slowed down #3314

Closed KarimJedda closed 5 months ago

KarimJedda commented 7 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

We're investigating, let me know how we can help. @eskimor would you mind adding the peer count screenshot here as well for reference please?

cc @lovelaced

Steps to reproduce

JelliedOwl commented 7 months ago

However you do implement this (if you ever do) please don't forget about us collators, running potentially pretty old versions on the Kusama side... :-). How old a version does the CI process test can still sync?

burdges commented 7 months ago

the node can just report whatever the runtime expects it to report

Yes, but then if they change the code like this then it's their fault if they get slashed somehow.

We could easily design this so they must look at the code before reporting the false version, at which point they're doing more work than not upgrading. Imho, that's overkill since we're trying to help them really, and an unecessary development burden (each version has an associated random u64).

Arguably, we should do this on polakdot, but not immediately on kusama, since we want some miss behavior on kusama.

please don't forget about us collators

We're discussing validators mostly, but yes we do need to take host upgrades into consideration for collators too. It'll be off-chain messaging (XCMP) that really beings the chaos here.

tunkas commented 6 months ago

Not sure if this will be helpful on this issue, but it might so I'm posting my logs. It happened on my Kusama validator, in the first session I paravalidated after getting into the active set. I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session. Next session went without missing significat amount of votes. Here are the logs for that session, and subsequent one. I'm running on the latest version, of course. Kusama error report.txt

dcolley commented 6 months ago

https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ

  1. What version you are running. 1.8.0

  2. Logs around that time, the more logs you can provide the better.

│2024-03-11 18:56:54 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250619 (0x0168…d6c3), ⬇ 2.9MiB/s ⬆ 2.6MiB/s │ │2024-03-11 18:56:59 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250620 (0x1248…9c9b), ⬇ 3.7MiB/s ⬆ 3.9MiB/s │ │2024-03-11 18:57:00 ✨ Imported #22250623 (0x44c9…5815) │ │2024-03-11 18:57:04 💤 Idle (12 peers), best: #22250623 (0x44c9…5815), finalized #22250620 (0x1248…9c9b), ⬇ 2.8MiB/s ⬆ 3.5MiB/s │ │2024-03-11 18:57:06 ✨ Imported #22250624 (0xf82d…5620) │ │2024-03-11 18:57:06 ✨ Imported #22250624 (0x6339…6a9e) │ │2024-03-11 18:57:09 💤 Idle (12 peers), best: #22250624 (0xf82d…5620), finalized #22250621 (0x1429…942a), ⬇ 2.2MiB/s ⬆ 3.0MiB/s │ │2024-03-11 18:57:12 ✨ Imported #22250625 (0xd906…805a) │ │2024-03-11 18:57:14 💤 Idle (12 peers), best: #22250625 (0xd906…805a), finalized #22250622 (0x70e8…48f8), ⬇ 1.6MiB/s ⬆ 2.1MiB/s │ │2024-03-11 18:57:18 ✨ Imported #22250626 (0x3d86…c5d6) │ │2024-03-11 18:57:18 ♻️ Reorg on #22250626,0x3d86…c5d6 to #22250626,0xa7af…d809, common ancestor #22250625,0xd906…805a │ │2024-03-11 18:57:18 ✨ Imported #22250626 (0xa7af…d809) │ │2024-03-11 18:57:19 💤 Idle (10 peers), best: #22250626 (0xa7af…d809), finalized #22250623 (0x44c9…5815), ⬇ 2.3MiB/s ⬆ 2.9MiB/s │ │2024-03-11 18:57:24 ✨ Imported #22250627 (0xffd6…0d62) │ │2024-03-11 18:57:24 ✨ Imported #22250627 (0x6aaf…cbca) │ │2024-03-11 18:57:24 ✨ Imported #22250627 (0x129f…5e18) │ │2024-03-11 18:57:24 💤 Idle (10 peers), best: #22250627 (0xffd6…0d62), finalized #22250624 (0xf82d…5620), ⬇ 2.1MiB/s ⬆ 2.8MiB/s │ │2024-03-11 18:57:29 💤 Idle (11 peers), best: #22250627 (0xffd6…0d62), finalized #22250625 (0xd906…805a), ⬇ 1.6MiB/s ⬆ 2.1MiB/s │ │2024-03-11 18:57:30 ✨ Imported #22250628 (0x41d1…dd49) │ │2024-03-11 18:57:30 ♻️ Reorg on #22250628,0x41d1…dd49 to #22250628,0xeb99…6bb5, common ancestor #22250627,0xffd6…0d62 │ │2024-03-11 18:57:30 ✨ Imported #22250628 (0xeb99…6bb5) │ │2024-03-11 18:57:30 ✨ Imported #22250628 (0x1091…17dc) │ │2024-03-11 18:57:30 ✨ Imported #22250628 (0xd5e2…0382) │ │2024-03-11 18:57:34 💤 Idle (12 peers), best: #22250628 (0xeb99…6bb5), finalized #22250625 (0xd906…805a), ⬇ 2.1MiB/s ⬆ 3.5MiB/s
... 2024-03-11 18:57:48 ✨ Imported #22250631 (0xf38f…2bce) │ │2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers │ │ │ │Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│ datorIndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(92): {(ValidatorI│ ndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(90): {(ValidatorIndex(9│ 1), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}} parent_hash=0xeb997bbb60ca75e90935d59c1b142f│ a57aac5402cd28b577fa2211cd7a026bb5 │ │2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers │ │ │ │Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│ datorIndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(91): {(ValidatorI│ ndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(92): {(ValidatorIndex(9│ 1), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}} parent_hash=0x1091044e68e1e359c1518de03a776d│ 0f14b81f7aed347fbc0fca7a49dfed17dc │ │2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers │ │ │ │Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│ datorIndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(92): {(ValidatorI│ ndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(90): {(ValidatorIndex(9│ 1), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}} parent_hash=0xd5e2b78489e5eda3fabe2845417bce│ 90215b94a7685d1419af1a75b6eacb0382 │ │2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers │ │ │ │Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│ datorIndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(91): {(ValidatorI│ ndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(92): {(ValidatorIndex(9│ 1), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}} parent_hash=0x41d1cf34c40cdda18adf75b22ebf37│ bad22b623fad674d0d050d3531865ddd49 │ │2024-03-11 18:57:49 💤 Idle (12 peers), best: #22250631 (0xf38f…2bce), finalized #22250628 (0x1091…17dc), ⬇ 3.4MiB/s ⬆ 3.4MiB/s │ │2024-03-11 18:57:54 ✨ Imported #222506



3. Any event around that time, e.g we upgraded/restart the validator, we observed validator entered the active set, anything that you think might be relevant.
No intervention

4. Anything relevant/special about your network setup
Running in docker

5. OS, does the validator passes the hardware requirement checks.
yes

6. Any action take to get the validator out of this state or did it recover by itself.
recovered by itself, next session was A+
eskimor commented 6 months ago

Running in docker

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

dcolley commented 6 months ago

These are the startup logs:

│2024-03-04 20:09:38 This chain is not in any way                                                                                                 ┤
│2024-03-04 20:09:38       endorsed by the                                                                                                        ┤
│2024-03-04 20:09:38      KUSAMA FOUNDATION                                                                                                       ┤
│2024-03-04 20:09:38 ----------------------------                                                                                                 ┤
│2024-03-04 20:09:38 Parity Polkadot                                                                                                              ┤
│2024-03-04 20:09:38 ✌️  version 1.8.0-ec7817e5adc                                                                                                 ┤
│2024-03-04 20:09:38 ❤️  by Parity Technologies <admin@parity.io>, 2017-2024                                                                       ┤
│2024-03-04 20:09:38 📋 Chain specification: Kusama                                                                                               ┤
│2024-03-04 20:09:38 🏷  Node name: METASPAN3 (ALSO TRY POOL #50)                                                                                  ┤
│2024-03-04 20:09:38 👤 Role: AUTHORITY                                                                                                           ┤
│2024-03-04 20:09:38 💾 Database: ParityDb at /data/chains/ksmcc3/paritydb/full                                                                   ┤
│2024-03-04 20:09:40 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"                                               ┤
│2024-03-04 20:09:40 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"                                               ┤
│2024-03-04 20:09:40 Can't use warp sync mode with a partially synced database. Reverting to full sync mode.                                      ┤
│2024-03-04 20:09:40 🏷  Local node identity is: 12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaSfxpxDCM1LW                                              ┤
│2024-03-04 20:09:40 Warp sync failed. Continuing with full sync.                                                                                 ┤
│2024-03-04 20:09:40 💻 Operating system: linux                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU architecture: x86_64                                                                                                  ┤
│2024-03-04 20:09:40 💻 Target environment: gnu                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU: 12th Gen Intel(R) Core(TM) i9-12900K                                                                                 ┤
│2024-03-04 20:09:40 💻 CPU cores: 16                                                                                                             ┤
│2024-03-04 20:09:40 💻 Memory: 128584MB                                                                                                          ┤
│2024-03-04 20:09:40 💻 Kernel: 5.15.0-89-generic                                                                                                 ┤
│2024-03-04 20:09:40 💻 Linux distribution: Ubuntu 22.04.3 LTS                                                                                    ┤
│2024-03-04 20:09:40 💻 Virtual machine: no                                                                                                       ┤
│2024-03-04 20:09:40 📦 Highest known block at #22150878                                                                                          ┤
│2024-03-04 20:09:40 〽️ Prometheus exporter started at 0.0.0.0:9615                                                                               ┤
│2024-03-04 20:09:40 Running JSON-RPC server: addr=127.0.0.1:9944, allowed origins=["http://localhost:*", "http://127.0.0.1:*", "https://localhost┤
│2024-03-04 20:09:40 👶 Starting BABE Authorship worker                                                                                           ┤
│2024-03-04 20:09:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...                                                              ┤
│2024-03-04 20:09:40 🚨 Some security issues have been detected.                                                                                  ┤
│Running validation of malicious PVF code has a higher risk of compromising this machine.                                                         ┤
│  - Optional: Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and m┤
│  - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPER┤
│2024-03-04 20:09:40 👮‍♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines.        ┤
│More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode                                      ┤
│2024-03-04 20:09:40 Sending fatal alert BadCertificate                                                                                           ┤
│2024-03-04 20:09:40 🔍 Discovered new external address for our node: /ip4/195.144.22.130/tcp/30333/p2p/12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaS┤
│2024-03-04 20:09:41 ✨ Imported #22150879 (0xa7de…1dc5)
tunkas commented 6 months ago

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

I'm not running on Docker 🤷🏻‍♂️

eskimor commented 6 months ago

One interesting metric would be: "polkadot_parachain_peer_count" for the validation peerset. To be queried like this:

polkadot_parachain_peer_count{protocol=~"validation/.*"}

How do these numbers behave with regards to era points? Once you become a para validator this should be around 1k.

xxbbxb commented 6 months ago

yes I can confirm we do experience significant drop of connected peers after updating/restarting the node until that moment we rotate session keys image

eskimor commented 6 months ago

We found the cause and are working on a fix. What operators can do right away: Make sure the node can persist all data, otherwise it will generate a new PeerId on each restart and this is causing issues.

Polkadot-Forum commented 6 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/async-backing-development-updates/6176/5

alexggh commented 6 months ago

Conclusions

There are 3 failure-modes going-on kusama, all create validators with 0 backing rewards:

  1. Validators not updated to versions supporting async-backing, they will get 0 points until they upgrade -> Nothing to be done here.

  2. Validators that change their Peer-ID during restart because it is not persisted, fix should come with https://github.com/paritytech/polkadot-sdk/issues/3673, till then there are ways to workaround this problem by persistenting the PeerID. They will get 0 points after restart until workaround is applied.

  3. Validators reporting that the first session when they enter the active set they get 0 backing points, reported several times in this ticket and here https://github.com/paritytech/polkadot-sdk/issues/3613. This is not as problematic as the other, because validators recover next session, but a problem nonetheless.

Also thank you @xxbbxb for providing logs and helping me root cause item number 2.

Next priorities in order:

alexggh commented 6 months ago

Root-cause what is wrong with validators entering the active set.

Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect it at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files

alexggh commented 5 months ago

Closing this is not happening anymore and the remaining open action item is being tracked in https://github.com/paritytech/polkadot-sdk/issues/3673