Kusama parachains block production slowed down

KarimJedda commented 4 months ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Description of bug

Block production has slowed down on Kusama parachains
The issue was raised by Khala & Basilisk, with reports of 130s blocks on Basilisk
Kusama seems to be impacted as well

We're investigating, let me know how we can help. @eskimor would you mind adding the peer count screenshot here as well for reference please?

cc @lovelaced

Steps to reproduce

KarimJedda commented 4 months ago

Below is an extract from our data for the Kusama relay chain:

--- number of blocks that took more than 6001 ms to produce (limited to start of February 2024 until today)
SELECT 
  CAST(block_time as DATE) as day, 
  count(number) as slow_blocks
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
WHERE time_to_produce_ms > 6002 
GROUP BY day
ORDER BY day ASC;

day	slow_blocks
2024-02-01	148
2024-02-02	142
2024-02-03	106
2024-02-04	136
2024-02-05	137
2024-02-06	118
2024-02-07	162
2024-02-08	139
2024-02-09	109
2024-02-10	112
2024-02-11	106
2024-02-12	106
2024-02-13 (17:30 CET)	75

For reference, here's the same thing but by day and since January 2024:

Screenshot from 2024-02-13 21-02-38

Here are the top 15, ordered by the time it took for them to be produced:

SELECT 
  number,
  block_time,
  time_to_produce_ms
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
ORDER BY time_to_produce_ms DESC
LIMIT 15;

(Subscan linked for convenience)

number	block_time	time_to_produce_ms
21806210	2024-02-09 19:31:54.000000 UTC	42000
21824720	2024-02-11 02:30:06.000000 UTC	41999
21758223	2024-02-06 11:12:48.001000 UTC	18001
21768615	2024-02-07 04:36:00.001000 UTC	18001
21818664	2024-02-10 16:20:54.000000 UTC	18000
21754427	2024-02-06 04:52:30.000000 UTC	18000
21727213	2024-02-04 07:21:00.000000 UTC	18000
21851077	2024-02-12 22:32:18.000000 UTC	17999
21766166	2024-02-07 00:29:30.000000 UTC	17999
21686302	2024-02-01 10:58:12.000000 UTC	17999
21778337	2024-02-07 20:53:48.000000 UTC	17992
21745818	2024-02-05 14:28:43.863000 UTC	13862
21711425	2024-02-03 04:58:12.481000 UTC	12480
21811860	2024-02-10 04:58:06.168000 UTC	12167
21700641	2024-02-02 10:58:06.137000 UTC	12137

And the list of slow blocks (that took longer than 6002 ms, February 2024) for reference: https://storage.googleapis.com/dashboards.data.paritytech.io/debug/slow_blocks.csv

Anything else we can provide on the parachain side too please let us know.

alexggh commented 4 months ago

Investigation

We did some initial analysis looking at the dashboard it seems that around 2023-02-12 18.00 PM UTC the average block times seems to have jumped slightly see the graph bellow.

Looking the dashboards around the same time we also see some errors in pending connection metrics on our kusama validators, not sure if it is related to the degradation, but I'm putting it here in case it rings a bell.

Potential root-cause(Behaviour observed after 2023-02-12 18.00 PM UTC)

Looking at few parachains we observed that there are 1 minute periods where no blocks gets backed for the parachains and after further investigation we observed that it is during the same group of the session that the parachains don't produce blocks:

For example for session 36926 parachains AssetHub and Bridge don't produce blocks while they should be allocated to group 20, this can be double-checked by looking at CandidateIncluded events and see that the parachains produce blocks while on group 19, then both have a gap while they should be on 20 and after that they start producing blocks when they assignment moves on group 21.

Parachain 1000: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870979

Parachains 10002: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870956

Same things happens during session 36927 and this time group 44 is the bad one.

The potential root-cause of this is that we have a bunch of validators either in a bad state or they are running a version that doesn't not support the logic for async-backing, so they can't participate in the backing process, so if 2 out of 3 validators in the backing group are bad the parachain won't produce blocks for the entire duration(1m) it is allocated to the backing group.

Looking at https://apps.turboflakes.io/?chain=kusama#/insights at session 36926, there seems to be around 17 out 300 validators have 0 backing votes, so those might be the possible culprits for why we have a bad backing group every session.

The list of validators for session 36927 is here:

Note! This is just a snapshot for session 36927, so it is not an exhaustive list.

Next steps

[ ] Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running and what logs they see.

ggwpez commented 4 months ago

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

bkchr commented 4 months ago

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

I hope that you don't find anyone that has opened their rpc port to the outside 🙃

bkchr commented 4 months ago

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

ordian commented 4 months ago

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

I've heard someone already built this tool: https://forum.polkadot.network/t/experiment-p2p-network-visualization/4583/8

bkchr commented 4 months ago

Perfect! This is exactly what we need.

sandreim commented 4 months ago

Would it be possible to implement a better incentive for validators to upgrade ? One idea is to remove any validator from the active set if their version is lagging 2+ releases behind, but I am not sure how feasible is to implement.

bkchr commented 4 months ago

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

Apparently people should already get less rewards and they don't care.

BulatSaif commented 4 months ago

Here what I was able to extract from DHT with nebula:

    802 "Parity Polkadot/v1.7.0-97df9dd6554 
    405 "Parity Polkadot/v1.6.0-680adc78ab3 
    106 "Parity Polkadot/v1.7.0-2fe3145ab9b 
     53 "Parity Polkadot/v1.5.0-a3dc2f15f23 
     53 "Parity Polkadot/v1.4.0-a3e4fce9d3f 
     37 "Parity Polkadot/v1.6.0-481165d9229 
     36 "Parity Polkadot/v0.9.41-e203bfb396e 
     34 "Parity Polkadot/v1.3.0-7c9fd83805c 
     19 "Parity Polkadot/v1.0.0-1ed6e2e50a4 
     13 "Parity Polkadot/v1.6.0-85191e94b5e 
      9 "Parity Polkadot/v1.4.0-00395a23d12 
      7 "Parity Polkadot/v1.1.0-f60318f6868 
      7 "Parity Polkadot/v0.9.42-9b1fc27cec4 
      3 "Parity Polkadot/v1.0.0-c9ec8c5a159 
      3 "Parity Polkadot/v0.9.43-ba42b9ce51d 
      2 "Parity Polkadot/v1.7.0-unknown 
      2 "Parity Polkadot/v1.6.0-unknown 
      2 "Parity Polkadot/v1.6.0-ff2e7dbd355 
      2 "Parity Polkadot/v1.6.0-c415f8f0bd6 
      2 "Parity Polkadot/v0.9.40-a2b62fb872b 
      2 "Parity Polkadot/v0.9.32-c71e872afed 
      1 "Parity Polkadot/v1.6.0-b8f55d1b769 
      1 "Parity Polkadot/v1.6.0-481165d 
      1 "Parity Polkadot/v1.5.0-e5b2adac7ad 
      1 "Parity Polkadot/v1.5.0-64d52f225e1 
      1 "Parity Polkadot/v1.2.0-930cabbe271 
      1 "Parity Polkadot/v1.2.0-72c45356393 
      1 "Parity Polkadot/v0.9.43-52209dcfe54 
      1 "Parity Polkadot/v0.9.38-909567fe6da 
      1 "Parity Polkadot/v0.9.31-32dd0c9cfcd 
      1 "Parity Polkadot/v0.9.30-064536093f5 
      1 "Parity Polkadot/v0.9.25-5174e9ae75b

Here script which I used:

nebula crawl --network KUSAMA  --json-out /tmp/out > kusama.txt
cat kusama.txt | grep Agent | cut -d = -f3- | grep "Parity Polkadot"| cut -d'(' -f1| sort |  uniq -c |  sort -bgr

alexggh commented 4 months ago

Here what I was able to extract from DHT with nebula:

Thank you, the number seems a lot more different than what I see on telemetry any idea why ?

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

sandreim commented 4 months ago

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

There shouldn't be any forks :P

Apparently people should already get less rewards and they don't care.

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

BulatSaif commented 4 months ago

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

I forgot to add report summary:

crawlDuration=6m57.330136797s crawledPeers=9500 dialablePeers=2261 undialablePeers=7239

7239 node has p2p port closed and may have any version.

FYI, there are multiple telemetry: https://telemetry.w3f.community, https://telemetry.polkadot.io/

alexggh commented 4 months ago

Parachains block production starting to look like their pre 2023-02-12 18.00 PM UTC values:

The parachain block production average

AssetHub

alexggh commented 4 months ago

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

lovelaced commented 4 months ago

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

A combination of them and also none of them; the w3f telemetry is for the 1kv program, and the polkadot.io one is for validators who are connected to telemetry. Telemetry is on by default, but plenty of validators do not run with public telemetry enabled, so it's not possible to get a complete picture this way.

bkchr commented 4 months ago

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

As already discussed in Element, we should change the rewards. Decreasing the rewards per block build to a minimum and put most of the rewards in doing useful work for the parachain consensus.

The tool for finding out what version a validator is running, could also be used as data source for informing nominators on what kind of validators to nominate.

JelliedOwl commented 4 months ago

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

There's definitely some of this going on. For example one of my parachain collators shows up on telemetry as "Parity Polkadot v0.9.28-something". That one definitely isn't validating.

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

bkchr commented 4 months ago

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

JelliedOwl commented 4 months ago

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

It's more a nagging concern at the back of my mind than a full-on worry. I do hope to find the time / funding to upgrade it though. :-)

[Edit: sorry BulatSaif already said this.] Oh, and I think @alexggh might have been asking why there are far fewer nodes on telemetry than in the P2P-derived stats. Being on telemetry is optional so many nodes aren't on it (including 3 of mine that I thought were). There's also the 1KV program telemetry that probably has a list of nodes, only some of which are on the main telemetry. https://telemetry.w3f.community/#list/0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe

alexggh commented 4 months ago

Unfortunately, the situation seems to have drastically deterioarated overnight around 2023-02-15 22:00 UTC, the dashboards look even worse now.

Runtime api requests jumped 20%

The only things I see in the logs is that around that time we've got some disputes, but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

2024-02-14  cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xda8c113bdaf76c305020386f430bc49ff6edcd28bd38c84ff1f04d7b7e67c1cb

Looking here: https://apps.turboflakes.io/?chain=kusama#/insights?mode=history, it seems that around that time a bunch of validators started getting very low reward rates(0) and they continue getting them for follow up sessions, so it feels like something is getting them in a bad state.

bkchr commented 4 months ago

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

alexggh commented 4 months ago

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

Don't know which fix you are referring to.

tdimitrov commented 4 months ago

@bkchr https://github.com/paritytech/polkadot-sdk/pull/1358 -> this one? If I'm not mistaken it's included since 1.0.0

bkchr commented 4 months ago

Yeah this one and yeah it got included in 1.1.

The log lines of state being discarded for the dispute reminded me of this. We should probably still figure out why the state wasn't available anymore as this should not happen.

Polkadot-Forum commented 4 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-digest-15-feb-2024/6218/1

eskimor commented 4 months ago

@BulatSaif What is nebula and where can I find it?

BulatSaif commented 4 months ago

@BulatSaif What is nebula and where can I find it?

It is a DHT crawler: https://github.com/dennis-tra/nebula

alexggh commented 4 months ago

Dumping here the info I gathered from other channels:

2024-02-14 22:09:18.566  INFO tokio-runtime-worker substrate: ✨ Imported #21879566 (0xbac6…48c7)
2024-02-14 22:09:19.502  INFO tokio-runtime-worker substrate: 💤 Idle (45 peers), best: #21879566 (0xbac6…48c7), finalized #21879485 (0xe1ee…cfb5), ⬇ 2.2MiB/s ⬆ 3.8MiB/s
2024-02-14 22:09:20.262  WARN tokio-runtime-worker peerset: Report 12D3KooWEPzR9AE8jAueVuh9NpTAQ4B7cvmudAAxBM1UQKpnmqHL: -977470000 to -2085764279. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.262  WARN tokio-runtime-worker peerset: Report 12D3KooWSVKg24GWMK4abZjf9145z6Tm7rYwjxUbW3FihvmvgCEo: -1044045000 to -2147483648. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.262  WARN tokio-runtime-worker peerset: Report 12D3KooWDBXkSCoCkqkytX239n6WEdCmHnDdx573Yhp4pJPXjWKr: -858180000 to -1843670626. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.262  WARN tokio-runtime-worker peerset: Report 12D3KooWS6TEWKr75a1iAYGBQinaCJ4zZf8EAPc1kK5Q2QW5ccpN: -1289735000 to -2147483648. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.360  WARN tokio-runtime-worker peerset: Report 12D3KooWDBXkSCoCkqkytX239n6WEdCmHnDdx573Yhp4pJPXjWKr: -200000 to -1843870626. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.360  WARN tokio-runtime-worker peerset: Report 12D3KooWS6TEWKr75a1iAYGBQinaCJ4zZf8EAPc1kK5Q2QW5ccpN: +1200000 to -2146283648. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.360  WARN tokio-runtime-worker peerset: Report 12D3KooWSVKg24GWMK4abZjf9145z6Tm7rYwjxUbW3FihvmvgCEo: -300000 to -2147483648. Reason: Aggregated reputation change. Banned, disconnecting.
2024-02-14 22:09:20.360  WARN tokio-runtime-worker peerset: Report 12D3KooWEPzR9AE8jAueVuh9NpTAQ4B7cvmudAAxBM1UQKpnmqHL: +800000 to -2084964279. Reason: Aggregated reputation change. Banned, disconnecting.

One of the nodes is reporting disconnects around the time the dispute started and we have a lag, @bkchr @altonen is there a chance this would make the node be segregated from those peer forever, or shouldn't the disconnect persist accross multiple sessions.

altonen commented 4 months ago

If it keeps offending and stays banned, then potentially yes. Banning shouldn't affect reserved peers though

cc @dmitry-markin

alexggh commented 4 months ago

If it keeps offending and stays banned, then potentially yes.

It doesn't keep offending.

Banning shouldn't affect reserved peers though

What if it wasn't in the active set at the time of the ofence ?

altonen commented 4 months ago

It can still keep offending over other protocols. We've seen this on Westend/Rococo before where the nodes kept sending duplicate block requests and each duplicate request decreased the peer's reputation by BANNED_THRESHOLD. Eventually the reputation will reach i32::MIN and it takes ~18 minutes to reach 0 if the peer does nothing else in the meantime.

If the peer is banned when you call set_reserved_peers(), ProtocolController won't open a connection to them until the reputation is has had time to decay below the ban threshold limit (source). It will also reject any inbound substream from a reserved peer if the peer is banned. This could be modified to ignore reputation for reserved peers

alexggh commented 4 months ago

It can still keep offending over other protocols. We've seen this on Westend/Rococo before where the nodes kept sending duplicate block requests and each duplicate request decreased the peer's reputation by BANNED_THRESHOLD. Eventually the reputation will reach i32::MIN and it takes ~18 minutes to reach 0 if the peer does nothing else in the meantime.

Ok, so I guess that's a red herring then, it should have recovered by itself if that was the trigger.

altonen commented 4 months ago

Would be interesting to know why it's getting banned but "Aggregated reputation change" is opaque. The banning mechanism can prevent a node from opening a substream to a reserved peer so that could have something to do with this issue. Is there any indication that there are fewer connections than there should be?

alexggh commented 4 months ago

Would be interesting to know why it's getting banned but "Aggregated reputation change" is opaque.

Around that time there is a dispute which take a lot of blocks to resolve so we have a finality lags, so nodes start resending their messages, so they get duplicates hence probably why we've got some bans.

The banning mechanism can prevent a node from opening a substream to a reserved peer so that could have something to do with this issue. Is there any indication that there are fewer connections than there should be?

This is how our connection validators looks like over 7 days, I would say nothing obvious.

alexggh commented 4 months ago

On our validators I also see a lot of time the connected ratio being between 90% and 95%, not sure if it is really relevant.

Connectivity Report connected_ratio=94 absolute_connected=992 absolute_resolved=1050 unconnected_authorities=

alexggh commented 4 months ago

Looking at this validator https://apps.turboflakes.io/?chain=kusama#/validator/DGiLC1uvohfgHiirDtsr5HJRYXUFy9NdXj6YC6azcaRpvJZ?mode=live during 36492 session where the group manages for an entire session to back just 3-4 candidates, what we can see in the logs is that:

The nodes receives candidate from the collator and issues a lot of seconded statements, but manages to receive Valid statements only 3 times from one of its peer, so only 4 blocks get backed by this group. Logs our node has (ValidatorIndex(145) https://grafana.teleport.parity.io/goto/5c76gwhSR?orgId=1.

This leads me to think there are two possible problems:

Either we have a timing issue and we are not managing to gather 2 votes out of the 3 validators fast enough.
There is something broken/off with the statement distribution v2.

Neither of these two options explain why this degraded at around 2023-02-15 22:00 UTC, but might stack together with a few bad validators.

JelliedOwl commented 4 months ago

Related to #3345 and this issue, there was another stall:

In case it's useful, I grabbed the RPC grandpa_roundState while it was stalled. We seem to have enough prevotes, but I don't seem to be able to get anything other than 100% missing for precommits. I'm not sure what the precommits state looks like when it's not stalled - sorry if this is useless data. [EDIT: Looking at Polkadot, I guess that's all normal - though it raises the (stupid?) question why we appear to have an over-threshold set of prevotes and yet no finality...]

{
  setId: 8,516
  best: {
    round: 4,974
    totalWeight: 999
    thresholdWeight: 667
    prevotes: {
      currentWeight: 970
      missing: [
        D2s6M2YDiDTkTEYxbBCELNqWdx9wbMD1xm3G6bD9vhNUAtu
        DB1YPb5FdVbD54ERmdDFvqGguk4SHrkXPkyQNBKdK9oCE2Y
        DMCBMHJyPx2N3ETxfBiWEqHiSmeLKmCxvJ4tLjqfz2szCGb
        DNusKMU2gEc5KMDtkKyDJ9Khds6zjWbYdyHx9QLXztWyo1j
        DPtfJfrQRuZ3qwLpjpunbXREC7kpcBKT1TzHJLnuYo54TZD
        DYirgQtNScc4MR3Ws1yB2WkpuPE43gJLT7JkeQPUNi7Jzen
        E1gB9FmaEXsNcsww1Uk4nVcqeUsodF2UYuaoS7stEatFgyR
        E8GRYNYmn5h62zXLCSL2PpUKac8wqsHHk9fAR4yMtVDY45h
        EAzhp5opDMKd4soCLUCNR6D8nzNvZ6JhVnZqJEJxgLkZ89s
        EQzNL6xpCmqNng2JZLe4mWftw5Z738dQVi82fxufSPgRYso
        EaZgnkE7GTTkc21LB39nT6t8qmRwNHzLPQBWz9jSKq8Ljyi
        EcDskGTfZBh5WgevKHEKkwLA1gn56Tu2q9XnMPfYwRkakiL
        FnZxwunfj7pF2xFCCtw32U5fG6U3AR3pUs8LYqVYrYK3rCG
        FoZqSiTVDxHY2BGt6jysDbr7Gc1z2D5McYiYY3CucUJBzdK
        FubJibGUkB3pWA73rS3dZ5u93PgZFFrzU1HKtpVH6z312BV
        GC7J5SmRBPChrv31gMpSjPTcPW9R8o6zqdAiEJaANrkvy9a
        GTzLcrLiQuWXSx6ourSXebxzzRymT9JAmLpMAHigSLQ7Est
        GYij6Xj2nuVurXM7ZZkdyU4PaxHdL5aP2jR9ZaDYdjjkrpG
        Gg51DXmpkcGm1MxCtub4QXtcppuRc276MStbyR2y9CT6G4Q
        Gwq2q1QLPoZ3ZioZj9mFcqHn9xpJSb4x2oaWb3HAQNVdPiT
        H7ZgQWGHV8n8NTjTvKqL4A6mdUzi7J2HiHXecWZZHWz2ERU
        H89W7zWmXQchJCrKMgE2BNSC8WEKbxJJDu6TBA1jtWAZ5dH
        HFvPuWWqDMQdUwGb9FU35c4oapL1rXBCw6WYG1AJjgTmjcU
        HU9uk6BMMfWv5fTpaNvguB4ekBgDrpt1ZYKfETC6DY4MET9
        HeSDGCQsfbe7fEbyQi8QRQdJwXW2WpV5z4siSP8KUKizHat
        HeWr4Xe6cY4QSrYj88jagJ7ZhS7mbSDqeD3JLzZQW1h5mrE
        HrhbuEw136yEE5MdAcqJiRfnY8JQXsASzoR4taaSUiAB1zS
        Hv82GzBTtZMpv3UUGqKWp1tm4eV4LsuYmeZaECUBAzxmSY3
        J2HjZ2oJmUC5HCX5JMyKuAXPrqabfDSVdutZNeTwWnxue7F
      ]
    }
    precommits: {
      currentWeight: 0
      missing: [
        CasJibxfsmQt5h3MzGyASUyjEw2LS9nDAYgBLDgG3vp4vR9
        Cb2E88hAt5Sphzh9BduDfoip23D78c44eyhgjyGtSoTJRig
        <snip list of all authorities >

ordian commented 4 months ago

Looking at the recent disputes, it seems to be initiated by validator "Activator": https://apps.turboflakes.io/?chain=kusama#/validator/E5hWebLmPSFSUq3nSfGuFYopAK65fooqiXV91gQwuZGuMad.

According to https://telemetry.w3f.community/#list/0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe, it is running 1.7.0 (although it could have upgraded during the night).

The candidate being disputed seems very heavy:

> ./target/release/kuddelmuddel validate-candidate --candidate-hash 0x64fe5fd71e4eaea9a560dff99e7fd948b3814e00d006076d8c09201ae96e27fe --network kusama --rpc-url wss://kusama-rpc.polkadot.io:443
Successfully fetched PoV for 0x64fe5fd71e4eaea9a560dff99e7fd948b3814e00d006076d8c09201ae96e27fe, para_id=2024
Fetching Pvf 0xaa3fe248257d50a7372520a6854a905309578b2b29e8d2cad4cafb7d70bb5cdf
PoV size: 7554kb
Pvf prechecking...
Pvf preparation took 3574ms
Pvf execution...
Execution took 1736ms

7.5mb of uncompressed PoV and 1.7s execution time on m1max from Genshiro parachain

Polkadot-Forum commented 4 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-digest-16-feb-2024/6259/1

Overkillus commented 4 months ago

Issue resulted in a finality stall on Kusama further described here: https://github.com/paritytech/polkadot-sdk/issues/3345#issuecomment-1948811835

alexggh commented 4 months ago

Some updates the dispute and finality lags was solved here: https://github.com/paritytech/polkadot-sdk/issues/3345, in the absence of disputes we can still see some bad backing groups that aren't able to back any candidates, so each parachain schedule on that group is not going to produce any block for the entire of the group rotation(1 min).

Reasons for getting bad backing groups(In the order of impact)

There are still validators that do not support async backing protocol, this is visible from the fact that they do not get along with the other validators in the set and they produce blocks a few blocks when they are in the same backing group. The biggest entities I identified in this category are Jaco and ZKValidator
Around the time this incident started(2023-02-12 18.00 PM) and subsequently on (2023-02-15 22:00 UTC) there are also a few validators that were producing candidate at full speed and then they went to 0, and they keep getting 0 backing points across multiple sessions. This validators can be identified in https://apps.turboflakes.io/?chain=kusama#/ when you look at the history they should be green and then went on red and kept the red status. It is not clear to me what made this validators get into this state, but a restart should probably help on restoring them back to the good state. Coincidentally, we should probably see them back online once people upgrade to 1.7.1 version that was release for fixing https://github.com/paritytech/polkadot-sdk/issues/3345.
Backing group size is 3 and you need 2 out of 3 votes to produce candidates, so you need just 2 bad validators in a backing group or the validators can't talk with each other, compare this with Polkadot where the group size is 5 so you would need 4 bad validators in the same group.

Next steps

[ ] Old validators should upgrade, to avoid 1) because they can't produce blocks anymore
[ ] Validators that are in state 2) should be restarted.
[ ] Root cause why 2) happened and how it can be prevented in the future(trying to reproduce this on versi.)

alexggh commented 4 months ago

Thanks to @lexnv in https://github.com/paritytech/data/issues/30 I manage to get a few more validators that are running incompatible versions, around 4%(with 895 out 1000) discovered.

Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-03)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-04)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-06)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-08)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-09)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-12)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-13)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-14)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-22)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-23)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-25)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-27)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-00-29)
Parity Polkadot/v0.9.41-e203bfb396e (teufel-09-33)
Parity Polkadot/v0.9.43-52209dcfe54 (Web3Italy)
Parity Polkadot/v1.0.0-1ed6e2e50a4 (Anonstake)
Parity Polkadot/v1.1.0-f60318f6868 (ZKValidator-3)
Parity Polkadot/v1.1.0-f60318f6868 (ZKValidator-4)
Parity Polkadot/v1.2.0-72c45356393 (kusama-web3-val54)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val41)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val42)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val43)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val44)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val45)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val46)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val47)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val48)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val49)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val50)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val51)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val52)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val53)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val55)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val56)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val57)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val58)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val59)
Parity Polkadot/v1.3.0-7c9fd83805c (kusama-web3-val60)

Besides this nodes not being able to back anything, when they are block authors they won't be able to include any back candidates because they wouldn't have received the statements.

alexggh commented 4 months ago

Another round of updates, did not have too much luck with finding the root cause for point number 2) above https://github.com/paritytech/polkadot-sdk/issues/3314#issuecomment-1953542045, effectuated a series of testing round on versi with a network with 60 validators, 20 parachains and 3 validator groups, the network is an ideal state(all validators running latest polkadot version, all nodes are discoverable by its peers), with that the results show me parachains produce blocks at constant rate of updates around 12s, kusama average parachain block time is right now around 14.5s and before async backing was enabled it was around 13.5, so I think we will get close to that once validators with incompatible versions upgrade.

Tested/simulated scenarios(60 validators, 20 parachains, 3 validator group size) and all produce good results.

Versi in ideal state with async backing code enabled produce around 12s block times for all parachains.
Simulated more than 400ms latency when fetching candidates via AttestedCandidateV2.
Slow down view updates with 1 second to
Make some of the nodes un-discoverable/not working, this leads to bad backing groups which deteriorates parachain average block time.
Rotated session keys for all validators in the active set.

The next steps for me here are:

[x] Add more logs on potential paths that would make validators not participate in backing at all https://github.com/paritytech/polkadot-sdk/pull/3419 and wait for that to get in the next release and hopefully that will give us more data to understand if 2) is even an issue.

eskimor commented 4 months ago

Thank you @alexggh for this again very thorough investigation!

Polkadot-Forum commented 4 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/finality-stall-on-kusama-15-02-2024-post-mortem/6398/1

alexggh commented 4 months ago

Some information from kusama-lounge, not sure if it is relevant, but seems the behaviour from 2) https://matrix.to/#/!LhjZccBOqFNYKLdmbb:polkadot.builders/$170853974921ffrhb:substrate.systems?via=parity.io&via=matrix.org&via=corepaper.org.

https://matrix.to/#/!LhjZccBOqFNYKLdmbb:polkadot.builders/$170854084124BfAhA:substrate.systems?via=parity.io&via=matrix.org&via=corepaper.org

A few people reporting that after updating some validators get this error and only session key rotation seem to make the validators participate in backing again.

Logs:

2024-02-19 00:27:55 👶 New epoch 37039 launching at block 0xd378…da99 (block slot 284717079 >= start slot 284717079).    
2024-02-19 00:27:55 👶 Next epoch starts at slot 284717679    
2024-02-19 00:27:55 Trying to remove unknown reserved node 12D3KooWRUoMPdANg2WwhspbAs7KuGYevZ3KQ4mTFwAcXDEogzMH from SetId(3).    
2024-02-19 00:28:12 👴 Applying authority set change scheduled at block #21938194    
2024-02-19 00:28:12 👴 Applying GRANDPA set change to new set with 999 authorities    
2024-02-19 00:36:19 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x3ada7d3741bc67bd2be011f5d2d821c3913a115a65eac198f016ffecee40864a
2024-02-19 00:44:10 Report 12D3KooWCaXyaqB2bUbNthAoSZYhHY1Nxw4sqpbrrqBhuK4xqznE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-02-19 00:44:42 Report 12D3KooWCaXyaqB2bUbNthAoSZYhHY1Nxw4sqpbrrqBhuK4xqznE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-02-19 00:45:15 Report 12D3KooWCaXyaqB2bUbNthAoSZYhHY1Nxw4sqpbrrqBhuK4xqznE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-02-19 00:49:18 🙌 Starting consensus session on top of parent 0x600de1448b5f2cbca23dfb0821f89db3c6a6c4799936a3d71fcca7284f67c47d    
2024-02-19 00:49:18 🎁 Prepared block for proposing at 21938407 (52 ms) [hash: 0x734b4e844d5c2d62e4c6825bd51e8ca0b86f4c2c2e3fa524c327f789b8327215; parent_hash: 0x600d…c47d; extrinsics (7): [0x2d4a…8d4f, 0xa7d8…5cd7, 0x3010…cb40, 0xdd78…c1ce, 0x87d4…ca5e, 0x19c7…3e86, 0xcfc6…cfef]    
2024-02-19 00:49:18 🔖 Pre-sealed block for proposal at 21938407. Hash now 0x67df60736f61d38936a85d03857e74f6142c57b7a5d5cb42f0726be92c2945e3, previously 0x734b4e844d5c2d62e4c6825bd51e8ca0b86f4c2c2e3fa524c327f789b8327215.    
2024-02-19 00:52:05 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xe271f51a95e6ac1d15e1a62d72aefdb988659f39b9e950b43fe23a3af105a5e8
2024-02-19 01:03:27 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x880e2f445439041dd0c65593bcbdf2791013c7dd871cbfbf338ce4e011e67e84
2024-02-19 01:03:27 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x880e2f445439041dd0c65593bcbdf2791013c7dd871cbfbf338ce4e011e67e84
2024-02-19 01:05:24 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x0a55be09a8fef7506feb4bd3aaa5f84de7d84c501d7bb23c27ced60b83fa67aa
2024-02-19 01:07:41 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xffa2f86b19edee4df0732f272dbb6c4fe14049919872be6123088c662071d70c
2024-02-19 01:13:01 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xf3607cd751bbda2db7d167e9d80123a3ee42da61b4a275e1d2f174cb47bc9760
2024-02-19 01:14:25 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x2327477965ecedd661972befbcace1b9154159e0e6956d51eea55acecf366c05
2024-02-19 01:27:10 Report 12D3KooWCaXyaqB2bUbNthAoSZYhHY1Nxw4sqpbrrqBhuK4xqznE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.

Quotes from Validator Lounge on element.

Please pay attention to these log messages cannot query the runtime API version: Api called for an unknown Block: State already discarded for <hash>, 
yesterday I've updated some validators from 1.7.0 to 1.7.1 and faced this issue again. 
Such messages appear in logs when validator does not work correctly, and after 
rotate_keys, set_keys they disappear from logs and validator starts to participate 
in candidate backing normally. 

I assume that misbehaving validator is trying to pick these blocks as relay_parent for candidate backing.
I've checked and these discarded blocks existed at some point of time, but they were not included to blockchain. 

Some RPC endpoints even return data on these blocks, for example
curl -XPOST -H "content-type: application/json" -d '{"id":178,"jsonrpc":"2.0","method":"chain_getBlock","params":["0x297fb7b7aacf389098f2618ca5387f1c5a30e7726ad5e00b4e1c056baa9d1ec0"]}' https://rpc-kusama.luckyfriday.io/ "jsonrpc":"2.0","result":{"block":{"header":{"parentHash":"0xb8abc6060377cb8e46daa72291363df8f045633043df7e07de262a0c2fde962f","number":"0x14f97c4"... curl -XPOST -H "content-type: application/json" -d '{"id":178,"jsonrpc":"2.0","method":"chain_getBlockHash","params":[21993412]}' https://rpc-kusama.luckyfriday.io/ {"jsonrpc":"2.0","result":"0x967480ca771c6ef2b5f3a9c7aabb9fa623996f5c204fb815a2fbe0afe27abef1","id":178} 

but block with the same height has another hash. 
No idea why validator picks wrong block as relay_parent and moreover, no idea why issue fixed after key rotation

Happened since 1.5.0. Something like 4/5 times over 12 validators.

burdges commented 4 months ago

We should really drop outdated hosts before updates that required host updates. In principle, we could just record minimum required and recommended host versions on-chain, make validators warn when behind the recommended, and make them shut down when behind the required. This doesn't solve the immediate problem however, so..

We could add host version into their NPoS candidacy, so then other validators would drop them in NPoS. I doubt anything more complex would be warranted.

It's perfectly reasonable byzantine behavior to run an old host and lie about version numbers though, so probably not this whole problem.

sandreim commented 4 months ago

This is fairly similar to what I had in mind and been discussing offline, but @bkchr didn't fully support it at the time. These ancient hosts clearly not getting rewards in current situation so they still have incentive to upgrade. The only downside in this specific case is that they can degrade the network performance, which sounds to me that we need some form of punishment to prevent it rather than rely only on positive rewards.

bkchr commented 4 months ago

Yeah, generally I'm not full on board yet with reporting random versions to the runtime. I mean the node can just report whatever the runtime expects it to report. We would also need to standardize these versions as other node implementors would need to report the same version when they have implemented the same features etc.

I opened an issue for reworking the reward points distribution: https://github.com/polkadot-fellows/runtimes/issues/214

If this isn't working, I'm open for what you both proposed here.

paritytech / polkadot-sdk