paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

Node massively misses paravalidation votes. #3613

Closed anignatev closed 6 months ago

anignatev commented 7 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Starting from version 1.6.0 node massively misses paravalidation votes. The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0. Returned version 1.5.0 and everything works correctly.

Attached are the logs of the validator node: 18:36 The node became active, version 1.8.0 was running. 18:44 We have returned version 1.5.0.


rustup show

Default host: x86_64-unknown-freebsd rustup home: /root/.rustup

installed toolchains

stable-x86_64-unknown-freebsd (default) nightly-x86_64-unknown-freebsd

installed targets for active toolchain

wasm32-unknown-unknown x86_64-unknown-freebsd

active toolchain

stable-x86_64-unknown-freebsd (default) rustc 1.76.0 (07dca489a 2024-02-04)

P.S. Other linux-based members have also complained about a similar problem.

polkadot.log.gz

IMG_7557

Steps to reproduce

  1. Compile the version containing the bug from the sources.
  2. Launch the version and wait for activity your node.
  3. Analyze the result of processing paravalidation sessions with statistics system (for example: apps.turboflakes.io).
sandreim commented 7 months ago

CC @alexggh

alexggh commented 7 months ago

Looking at the logs, it seems block-sync times are off I see plenty of occasions where blocks are imported 1-2s after their creation, that will make you miss statement because your node won't participate.

To give you a few examples, block 19797352 gets created at 07/03/2024, 18:38:48 and your node imports at 18:38:58 �M-^\� Imported #19797353, so you definitely missed statements there. Same with 19797358 it gets created at 39:24 and your node imports it 3 seconds later 39:27, so clearly you will miss statements next block.

After restart and downgrade to 1.5 things seems to have improved(you can spot blocks being imported at seconds that divide by 6, which is close to block creation time), I'm not sure if it is 1.5 itself that fixed or simply the fact that you restarted.

The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0

How did you reach this conclusion? Maybe it was just the restart that improved the situation ? ~What version were you running in the logs, before downgrading to 1.5.0 ?~ 1.8.0

Logs

  2810,65: Mar  7 18:37:07 dt2 polkadot[88598]: 2024-03-07 18:37:06 �M-^\� Imported #19797335 (0x0186�M-^@�ee4b)    
  2845,71: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^Y��M-^O  Reorg on #19797335,0x0186�M-^@�ee4b to #19797335,0x1b13�M-^@�153e, common ancestor #19797334,0xd98d�M-^@�a2b7    
  2846,65: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^\� Imported #19797335 (0x1b13�M-^@�153e)    
  2859,65: Mar  7 18:38:07 dt2 polkadot[88598]: 2024-03-07 18:38:07 �M-^\� Imported #19797336 (0xeccd�M-^@�80a9)    
  2874,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797337 (0x378d�M-^@�6357)    
  2887,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797337 (0x18ff�M-^@�4c74)    
  2900,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797338 (0x5f95�M-^@�59a1)    
  2913,65: Mar  7 18:38:08 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797338 (0x95bc�M-^@�9caf)    
  2926,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:08 �M-^\� Imported #19797339 (0x4a80�M-^@�f99a)    
  2940,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:09 �M-^\� Imported #19797340 (0xbe6c�M-^@�3600)    
  2952,65: Mar  7 18:38:09 dt2 polkadot[88598]: 2024-03-07 18:38:09 �M-^\� Imported #19797341 (0x5cb9�M-^@�8039)    
  2968,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0x323f�M-^@�2114)    
  2979,71: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^Y��M-^O  Reorg on #19797346,0x323f�M-^@�2114 to #19797346,0x47ac�M-^@�8d52, common ancestor #19797345,0x25b3�M-^@�d1a5    
  2980,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0x47ac�M-^@�8d52)    
  2981,65: Mar  7 18:38:13 dt2 polkadot[88598]: 2024-03-07 18:38:13 �M-^\� Imported #19797346 (0xd213�M-^@�279d)    
  2993,71: Mar  7 18:38:18 dt2 polkadot[88598]: 2024-03-07 18:38:18 �M-^Y��M-^O  Reorg on #19797346,0x47ac�M-^@�8d52 to #19797347,0xe363�M-^@�4ca6, common ancestor #19797345,0x25b3�M-^@�d1a5    
  2994,65: Mar  7 18:38:18 dt2 polkadot[88598]: 2024-03-07 18:38:18 �M-^\� Imported #19797347 (0xe363�M-^@�4ca6)    

 .... snipped  cut ...

  3051,65: Mar  7 18:38:42 dt2 polkadot[88598]: 2024-03-07 18:38:42 �M-^\� Imported #19797351 (0x70fa�M-^@�6010)    
  3059,65: Mar  7 18:38:58 dt2 polkadot[88598]: 2024-03-07 18:38:58 �M-^\� Imported #19797352 (0x8b53�M-^@�c651)    
  3062,65: Mar  7 18:38:58 dt2 polkadot[88598]: 2024-03-07 18:38:58 �M-^\� Imported #19797353 (0x5f6e�M-^@�2d15)    
  3065,65: Mar  7 18:39:00 dt2 polkadot[88598]: 2024-03-07 18:39:00 �M-^\� Imported #19797354 (0x7eaf�M-^@�04db)    
  3069,65: Mar  7 18:39:06 dt2 polkadot[88598]: 2024-03-07 18:39:06 �M-^\� Imported #19797355 (0x69a4�M-^@�8428)    
  3076,65: Mar  7 18:39:12 dt2 polkadot[88598]: 2024-03-07 18:39:12 �M-^\� Imported #19797356 (0xd45e�M-^@�ae38)    
  3083,65: Mar  7 18:39:19 dt2 polkadot[88598]: 2024-03-07 18:39:19 �M-^\� Imported #19797357 (0xfa3f�M-^@�926e)    
  3090,65: Mar  7 18:39:27 dt2 polkadot[88598]: 2024-03-07 18:39:27 �M-^\� Imported #19797358 (0x478a�M-^@�c9a5)    
  3095,65: Mar  7 18:39:30 dt2 polkadot[88598]: 2024-03-07 18:39:30 �M-^\� Imported #19797359 (0xa414�M-^@�fed8)    
  3100,65: Mar  7 18:39:36 dt2 polkadot[88598]: 2024-03-07 18:39:36 �M-^\� Imported #19797360 (0xe055�M-^@�aeb9)    
  3108,65: Mar  7 18:39:42 dt2 polkadot[88598]: 2024-03-07 18:39:42 �M-^\� Imported #19797361 (0xade9�M-^@�1855)    
  3113,65: Mar  7 18:39:42 dt2 polkadot[88598]: 2024-03-07 18:39:42 �M-^\� Imported #19797361 (0xe9ff�M-^@�dcc2)    
  3119,65: Mar  7 18:39:49 dt2 polkadot[88598]: 2024-03-07 18:39:48 �M-^\� Imported #19797362 (0x33be�M-^@�a219)    
  3129,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797363 (0xe732�M-^@�99c4)    
  3134,71: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^Y��M-^O  Reorg on #19797363,0xe732�M-^@�99c4 to #19797363,0x231e�M-^@�3c60, common ancestor #19797362,0x33be�M-^@�a219    
  3135,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797363 (0x231e�M-^@�3c60)    
  3140,65: Mar  7 18:40:05 dt2 polkadot[88598]: 2024-03-07 18:40:05 �M-^\� Imported #19797364 (0xfefa�M-^@�794d)    
  3146,65: Mar  7 18:40:14 dt2 polkadot[88598]: 2024-03-07 18:40:14 �M-^\� Imported #19797365 (0xa17b�M-^@�2e81)    
  3152,65: Mar  7 18:40:14 dt2 polkadot[88598]: 2024-03-07 18:40:14 �M-^\� Imported #19797366 (0x1092�M-^@�344e)    
  3158,65: Mar  7 18:40:20 dt2 polkadot[88598]: 2024-03-07 18:40:20 �M-^\� Imported #19797367 (0x7a44�M-^@�7fef)    
  3170,65: Mar  7 18:41:50 dt2 polkadot[88598]: 2024-03-07 18:41:50 �M-^\� Imported #19797368 (0xa70e�M-^@�7b08)    
  3173,71: Mar  7 18:41:51 dt2 polkadot[88598]: 2024-03-07 18:41:51 �M-^Y��M-^O  Reorg on #19797368,0xa70e�M-^@�7b08 to #19797368,0x1068�M-^@�b2c9, common ancestor #19797367,0x7a44�M-^@�7fef    
  3174,65: Mar  7 18:41:51 dt2 polkadot[88598]: 2024-03-07 18:41:51 �M-^\� Imported #19797368 (0x1068�M-^@�b2c9)    
  3197,65: Mar  7 18:42:02 dt2 polkadot[88598]: 2024-03-07 18:42:02 �M-^\� Imported #19797383 (0xc816�M-^@�d7d4)    
  3198,65: Mar  7 18:42:02 dt2 polkadot[88598]: 2024-03-07 18:42:02 �M-^\� Imported #19797384 (0x8d4c�M-^@�56df)    
  3201,65: Mar  7 18:42:09 dt2 polkadot[88598]: 2024-03-07 18:42:09 �M-^\� Imported #19797385 (0x3835�M-^@�8a04)    
  3202,58: Mar  7 18:44:22 dt2 polkadot[59947]: 2024-03-07 18:44:22 Parity Polkadot    
  3267,65: Mar  7 18:48:37 dt2 polkadot[59947]: 2024-03-07 18:48:36 �M-^\� Imported #19797443 (0x0a81�M-^@�0a24)    
  5450,65: Mar  7 18:48:42 dt2 polkadot[59947]: 2024-03-07 18:48:42 �M-^\� Imported #19797451 (0x0981�M-^@�308e)    
eskimor commented 7 months ago

Roping in Node SDK as well: @andresilva .

eskimor commented 7 months ago

Can we get more data on who is affected by this and what machines + operating systems + versions everybody affected is running? Also what is "similar" problems? Like how similar? Would be good if they could share their data as elaborate as possible as well.

eskimor commented 7 months ago

Also how does CPU/network usage compare between those versions for you?

anignatev commented 7 months ago

Looking at the logs, it seems block-sync times are off I see plenty of occasions where blocks are imported 1-2s after their creation, that will make you miss statement because your node won't participate.

To give you a few examples, block 19797352 gets created at 07/03/2024, 18:38:48 and your node imports at 18:38:58 �M-^\� Imported #19797353, so you definitely missed statements there. Same with 19797358 it gets created at 39:24 and your node imports it 3 seconds later 39:27, so clearly you will miss statements next block.

After restart and downgrade to 1.5 things seems to have improved(you can spot blocks being imported at seconds that divide by 6, which is close to block creation time), I'm not sure if it is 1.5 itself that fixed or simply the fact that you restarted.

The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0

How did you reach this conclusion? Maybe it was just the restart that improved the situation ? ~What version were you running in the logs, before downgrading to 1.5.0 ?~ 1.8.0

Thanks for the quick reply, I apologize for not being able to reply right away.....

I don't think the restart helped, because when this problem was first discovered (it was on February 2 on a our Kusama node), I reloaded version 1.6.0, but the node kept skipping votes. The problem only disappeared when I reverted to version 1.5.0. Logs and screenshots of statistics are attached.

Events on February 2: 00:04:59 - Node updated to version 1.6.0 but not yet active. 15:28:00 - Node has moved to the active state. Missed a lot of votes and received an F rating. 18:47:59 - Node has been rebooted, version is still 1.6.0. 18:52:36 - Returned version 1.5.0, with session 36645 started working normally. kusama.log.gz IMG_7145 2

anignatev commented 7 months ago

Can we get more data on who is affected by this and what machines + operating systems + versions everybody affected is running? Also what is "similar" problems? Like how similar? Would be good if they could share their data as elaborate as possible as well.

Hi, I am attaching an archive with hardware and operating system data. Please note that by communicating with other 1KV program members it has already become clear that the problem happened on different operating systems and on powerful servers. Unfortunately, I don't have data on the difference in CPU/network usage on these versions, but I have requested more data from other 1KV members.

system_info.zip

bkchr commented 7 months ago

and your node imports at 18:38:58

The node is for sure running with a synchronized clock?

Otherwise, we will need sync=trace logs (be aware that this generates a lot of logs)

bkchr commented 7 months ago

Also, do you run any special configuration? Any CLI arguments?

alexggh commented 7 months ago

I see you first sent us an analysis on polkadot and then an analysis on kusama, the reason I ask is that on kusama we do have this item 2) https://github.com/paritytech/polkadot-sdk/issues/3314#issuecomment-1953542045 which is currently not root caused which seems like the issue you seem to have problem with. On kusama you do see 0 backing rewards which is consistent with the bug we are trying to track down, but on polkadot you still getting 20% of the rewards, so might be different issues.

alexggh commented 7 months ago

Ok, looked at the kusama.log.gz. logs are really sparse, but I noticed is that you restarted your node at 14.09.44 before it enters the active set and at that point the node seems to have just an websocket external address, and not a P2P one, see the ws/p2p prefix instead of just p2p.

Not clear to me how you got into that state, but I don't think your authority would be discovered by others with just that external address, @altonen @dmitry-markin thoughts or is it just a red-herring ?

Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 00:05:07 ks1 polkadot[2100]: 2024-02-02 00:05:07 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30333/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 14:09:47 ks1 polkadot[81405]: 2024-02-02 14:09:47 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30333/ws/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^\M-^L�M-^O  version 1.6.0-5c79ed37dc6    
Feb  2 18:48:02 ks1 polkadot[6035]: 2024-02-02 18:48:02 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/ws/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^\M-^L�M-^O  version 1.5.0-e5b2adac7ad    
Feb  2 18:52:42 ks1 polkadot[6543]: 2024-02-02 18:52:42 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^\M-^L�M-^O  version 1.5.0-e5b2adac7ad    
Feb  2 18:54:32 ks1 polkadot[6734]: 2024-02-02 18:54:32 �M-^_M-^TM-^M Discovered new external address for our node: /ip4/62.77.130.13/tcp/30331/p2p/12D3KooWJryJLfeM1eJGhGgLggxQebUaBKRTgmTooANwteC2YpZQ    
altonen commented 7 months ago

It shouldn't have a WebSocket listening address if the node is run as a validator. Might be worth running the node with -lsub-libp2p=trace,libp2p=trace to figure out why that's happening.

tunkas commented 7 months ago

Not sure if this will be helpful on this issue, but it might so I'm posting my logs. It happened on my Kusama validator, in the first session I paravalidated after getting into the active set. I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session. Next session went without missing significat amount of votes. Here are the logs for that session, and subsequent one. I'm running on the latest version, of course. Kusama error report.txt

alexggh commented 7 months ago

Ok, looked a bit more closely @anignatev and it seems you node start with different roles:

00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    

When it is FULL it doesn't seem to work, did you change the CLI arguments somehow ? Why would it use a different role ? I don't think it is detected dynamically or is it @bkchr ?

bkchr commented 7 months ago

I don't think it is detected dynamically or is it @bkchr ?

Yeah not detected dynamically. The role is based on passing --validator.

alexggh commented 7 months ago

Yeah, ignore my previous comment, seems the log file contains logs from multiple nodes, I assume infsysgroup is the one of interest @anignatev ?


Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^O�  Node name: overrated-marble-6094    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^O�  Node name: uptight-sister-2530    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    
alexggh commented 7 months ago

Not sure if this will be helpful on this issue, but it might so I'm posting my logs. It happened on my Kusama validator, in the first session I paravalidated after getting into the active set. I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session. Next session went without missing significat amount of votes. Here are the logs for that session, and subsequent one. I'm running on the latest version, of course. Kusama error report.txt

@tunkas this seems fresh, have you got a link in app.turbo.flakes to your validator ?

tunkas commented 7 months ago

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

alexggh commented 7 months ago

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

tunkas commented 7 months ago

Here are the logs from this morning, nothing dubious in them imho. But that one was not such a bad one as the one last night. Kusama report 2.txt

tunkas commented 7 months ago

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

Could you explain more about this flag? What does it do exactly? Where do I put in my service file?

Thanks!

alexggh commented 7 months ago

Yes, this is from 18 hours ago. I had another one 5 hours ago :( Here is the link DBfT2GUqHX89afMhTzGCCbAc44zX33d4XySWX2qAPxZ35KE Sending the latest logs soon also.

@tunkas Since it seems it happens often to you would you be able to run your node with:

-lparachain=debug

or even trace if is is not too much too ask

-lparachain=trace

And give me the logs for a time when that happens ?

Could you explain more about this flag? What does it do exactly? Where do I put in my service file?

Thanks!

This are arguments to the polkadot binary, they enhance the accepted log levels debug is the minimum level we need to be able to understand what is going one, trace is the most verbose.

Both will increase considerably the size of your logfiles.

tunkas commented 7 months ago

Ok, understood. What would be the amount of this 'significant increase'? Is it something that would require me to upgrade my disk (my current remaing space is around 70 GB)?

So I just add them in my service file alongside other arguments? For Ubuntu it would be '--lparachain=trace'? Thanks!

alexggh commented 7 months ago

Ok, understood. What would be the amount of this 'significant increase'? Is it something that would require me to upgrade my disk (my current remaing space is around 70 GB)?

So I just add them in my service file alongside other arguments? For Ubuntu it would be '--lparachain=trace'? Thanks!

70gb, might not be enough if it runs for a long time, let's use debug not trace.

Yes, adding it next to the other arguments should be enough.

tunkas commented 7 months ago

OK, I'll try to time it for my next entrance into the active set, and will report back here with logs if I get the same result again 👍

Thanks again for your assistance!

anignatev commented 7 months ago

Yeah, ignore my previous comment, seems the log file contains logs from multiple nodes, I assume infsysgroup is the one of interest @anignatev ?


Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 00:04:59 ks1 polkadot[2100]: 2024-02-02 00:04:59 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^O�  Node name: overrated-marble-6094    
Feb  2 14:09:44 ks1 polkadot[81405]: 2024-02-02 14:09:44 �M-^_M-^Q� Role: FULL    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^O�  Node name: uptight-sister-2530    
Feb  2 18:47:59 ks1 polkadot[6035]: 2024-02-02 18:47:59 �M-^_M-^Q� Role: FULL    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:52:36 ks1 polkadot[6543]: 2024-02-02 18:52:36 �M-^_M-^Q� Role: AUTHORITY    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^O�  Node name: infsysgroup    
Feb  2 18:54:28 ks1 polkadot[6734]: 2024-02-02 18:54:28 �M-^_M-^Q� Role: AUTHORITY    

Ignore the different startup modes. To ensure uninterrupted operation, we have a main and backup server for each node. When we were experimenting, we transferred the validation to another server, and this one was switched to database synchronization only mode. I apologize for missing this moment, I am attaching the missing log (Feb 2 14:09:10 - Feb 2 18:52:18).

kusama_another_server.log.gz

CLI arguments KUSAMA: /usr/local/bin/polkadot --database paritydb --no-hardware-benchmarks --validator --name infsysgroup --chain=kusama \ --keystore-path ${keystore_dir} --base-path ${data_dir} --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' --insecure-validator-i-know-what-i-do --port=30331

CLI arguments POLKADOT: /usr/local/bin/polkadot --database paritydb --validator --name infsysgroup \ --keystore-path ${keystore_dir} --base-path ${data_dir} --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' --insecure-validator-i-know-what-i-do --port=30330

vilija19 commented 7 months ago

In my case, node missed all votes and authored one block. It happens when NONactive validators become active and are assigned as para-validator in the first session. The next session was normal - no missed votes Node version 1.8.0 Anomaly in logs during first session:

Mar 07 18:40:13 ns547523 polkadot[3156224]: 2024-03-07 18:40:13 Cluster has too many pending statements, something wrong with our connection to our group peers
Mar 07 18:40:13 ns547523 polkadot[3156224]: Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(60): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}, ValidatorIndex(61): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}, ValidatorIndex(62): {(ValidatorIndex(61), CompactStatement::Seconded(0x3f4504bad2ce6949089562e71ae0bfae9b59aa4fdb67b5b2f8b10e82a54273d7))}} parent_hash=0x53a67567c66031913d84f4a96ca7531abc6c510098bec9bc5216c50656fc026d

Mar 07 18:45:43 ns547523 polkadot[4035256]: 2024-03-07 18:45:43 received notification from 12D3KooWFqS1BZn3g8nc6NuMtLtrQMjoRmF3CzkjVduTTXBNaUPS who had been earlier refused by `SyncingEngine
eskimor commented 7 months ago

What does the node say about hardware specs?

2024-03-11 21:45:50 🏁 CPU score: 1.04 GiBs    
2024-03-11 21:45:50 🏁 Memory score: 10.77 GiBs    
2024-03-11 21:45:50 🏁 Disk score (seq. writes): 898.52 MiBs    
2024-03-11 21:45:50 🏁 Disk score (rand. writes): 265.97 MiBs    

In particular, what does it say about minimum requirements being met?

dcolley commented 7 months ago

I added my logs here: https://github.com/paritytech/polkadot-sdk/issues/3314#issuecomment-1990996796

validorange commented 7 months ago

I wanted to report a similar event. One of my validators missed 100% votes for Kusama session 37483. The validator was entering the active set on this session. Validator is DiSdjgV1fEu2i5wzxerBuuAWpypUQjAbt8WgLYWMZT4PGHj. Running version 1.8.0. It had been upgraded and restarted the day before. The next session it was active as a paravalidator was 37488 and it was normal.

The first hint I see in the logs of something abnormal is about 10 minutes after the session starts:

Mar  8 12:37:18 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:18 Report 12D3KooWC8XXFBmKiDCTmdwmRAyXfCv8Qou9XLMZskZBdcCt8rv4: -2147483648 to -2147483648. Reason: A collator provided a collation for the wrong para. Banned, disconnecting.
Mar  8 12:37:22 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:22 💤 Idle (16 peers), best: #22203794 (0x2bd2…cd65), finalized #22203790 (0x4f40…107c), ⬇ 2.8MiB/s ⬆ 2.7MiB/s
Mar  8 12:37:24 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:24 ✨ Imported #22203795 (0x9d2e…6608)
Mar  8 12:37:27 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:27 💤 Idle (16 peers), best: #22203795 (0x9d2e…6608), finalized #22203791 (0x98d4…da95), ⬇ 2.9MiB/s ⬆ 3.2MiB/s
Mar  8 12:37:30 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:30 ✨ Imported #22203796 (0x4f33…d716)
Mar  8 12:37:30 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:30 ✨ Imported #22203796 (0x6cb5…fe2c)
Mar  8 12:37:32 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:32 💤 Idle (16 peers), best: #22203796 (0x4f33…d716), finalized #22203791 (0x98d4…da95), ⬇ 2.0MiB/s ⬆ 2.7MiB/s
Mar  8 12:37:34 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:34 Cluster has too many pending statements, something wrong with our connection to our group peers
Mar  8 12:37:34 vo-ash1 polkadot[2066807]: Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(164): {(ValidatorIndex(162), CompactStatement::Seconded(0xe8275dca90e0cf4525f69dd4d908d126f713b0730808889d05da773ba42fbadd))}, ValidatorIndex(162): {(ValidatorIndex(162), CompactStatement::Seconded(0xe8275dca90e0cf4525f69dd4d908d126f713b0730808889d05da773ba42fbadd))}, ValidatorIndex(163): {(ValidatorIndex(162), CompactStatement::Seconded(0xe8275dca90e0cf4525f69dd4d908d126f713b0730808889d05da773ba42fbadd))}} parent_hash=0x8a2be6b75e99443a2fed5828f119c2773ffa369c1449786f2afaf8002ab870b8
Mar  8 12:37:34 vo-ash1 polkadot[2066807]: 2024-03-08 12:37:34 Cluster has too many pending statements, something wrong with our connection to our group peers
Mar  8 12:37:34 vo-ash1 polkadot[2066807]: Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(164): {(ValidatorIndex(162), CompactStatement::Seconded(0x6fafe0e985961474b4f03e22f60c5f2be0bddd0aec418196435b091a69632fe9))}, ValidatorIndex(162): {(ValidatorIndex(162), CompactStatement::Seconded(0x6fafe0e985961474b4f03e22f60c5f2be0bddd0aec418196435b091a69632fe9))}, ValidatorIndex(163): {(ValidatorIndex(162), CompactStatement::Seconded(0x6fafe0e985961474b4f03e22f60c5f2be0bddd0aec418196435b091a69632fe9))}} parent_hash=0x1b0a847ea3170bf3e03cc799038a78586e931d5bdcc51864da299b2911a032ef

Server is bare metal, Ubuntu 22.04. Not running in a container. There is also a Polkadot validator on this host (not in the active set recently).

Version is 1.8.0

Hardware passes the specs when I run the benchmark alone, but I see that it did not pass on startup, likely due to another validator process already running.

syslog.2066807.gz

eskimor commented 7 months ago

Thanks! We found one issue with changing PeerIds. If you restart the node, do you get the same PeerId everytime?

Check for output like this:

2024-03-12 23:18:50 🔍 Discovered new external address for our node: /ip4/81.223.86.162/tcp/30333/ws/p2p/12D3KooWKTMGaLha3tA3KPtWnS43UQpJ6jNo4EHAn914LtK1t58q    

The last part after p2p is your PeerId and should not change with restarts. This should be the case if the node is able to persist all data on disk.

tunkas commented 7 months ago

What I can see from my logs is that every time I had an F session it was due to peer numbers under 20. As soon as it recovered to over 20 I had again an A+ session.

dcolley commented 7 months ago

this morning I got C+ and I see these in the logs:

2024-03-13T06:31:57.898828+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:31:57 fetch_pov_job err=FetchPoV(NetworkError(Network(Timeout))) para_│
id=Id(2023) pov_hash=0x3304c769525848e8256f313a40f63713ab42f5ad6f13d073f56ff4f31bd6f425 authority_id=Public(ba4b6b04ee7784303a6a9aa3cdeaba97219b5dbbe1│
891ce26f1846aeda82f459 (GnapzFMm...))                                                                                                                 │
│2024-03-13T06:32:00.351928+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:00 ✨ Imported #22271889 (0xfed4…2383)                             │
│2024-03-13T06:32:00.401813+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:00 💤 Idle (19 peers), best: #22271889 (0xfed4…2383), finalized #22│
71881 (0x66a3…2c82), ⬇ 2.8MiB/s ⬆ 2.9MiB/s                                                                                                            │
│2024-03-13T06:32:02.705121+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:02 fetch_pov_job err=FetchPoV(NetworkError(Network(ConnectionClosed│
))) para_id=Id(2015) pov_hash=0x893df3623f005ffea58a0b516a30af401943e6d6cb4648c03e48071290e23f95 authority_id=Public(ba4b6b04ee7784303a6a9aa3cdeaba972│
19b5dbbe1891ce26f1846aeda82f459 (GnapzFMm...))                                                                                                        │
│2024-03-13T06:32:05.408718+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:05 💤 Idle (19 peers), best: #22271889 (0xfed4…2383), finalized #22│
71885 (0x6303…23ba), ⬇ 3.9MiB/s ⬆ 3.2MiB/s                                                                                                            │
│2024-03-13T06:32:06.510206+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:06 fetch_pov_job err=FetchPoV(NetworkError(Network(Timeout))) para_│
id=Id(2015) pov_hash=0x893df3623f005ffea58a0b516a30af401943e6d6cb4648c03e48071290e23f95 authority_id=Public(fabdc17e1df70f5816345ebf20436a7513a8358e67│
fc953e004dc45b56f35d58 (JF5rjJCZ...))                                                                                                                 │
│2024-03-13T06:32:06.910549+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:06 ✨ Imported #22271890 (0x46b7…95dd)                             │
│2024-03-13T06:32:10.415439+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:10 💤 Idle (19 peers), best: #22271890 (0x46b7…95dd), finalized #22│
71886 (0xb96e…1aad), ⬇ 3.2MiB/s ⬆ 5.6MiB/s                                                                                                            │
│2024-03-13T06:32:12.368533+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:12 ✨ Imported #22271891 (0x6951…75e4)                             │
│2024-03-13T06:32:15.423256+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:15 💤 Idle (19 peers), best: #22271891 (0x6951…75e4), finalized #22│
71886 (0xb96e…1aad), ⬇ 3.2MiB/s ⬆ 3.1MiB/s                                                                                                            │
│2024-03-13T06:32:15.423419+00:00 ksm-validator-1 polkadot[1179]: 2024-03-13 06:32:15 fetch_pov_job err=FetchPoV(NetworkError(Network(Timeout))) para_│
id=Id(2015) pov_hash=0xfd97055c3b7aa3173742a45e01c565f8b3c705d9704c85eba796a21baf3ca59e authority_id=Public(ba4b6b04ee7784303a6a9aa3cdeaba97219b5dbbe1│
891ce26f1846aeda82f459 (GnapzFMm...))  

The next session I just got A+ https://apps.turboflakes.io/?chain=kusama#/validator/HyLisujX7Cr6D7xzb6qadFdedLt8hmArB6ZVGJ6xsCUHqmx?mode=history

eskimor commented 7 months ago

Have you checked that your PeerId stays stable across restarts? (Described above)

vilija19 commented 7 months ago

Have you checked that your PeerId stays stable across restarts? (Described above)

Why is It so important? In most cases a missed vote ratio becomes normal in the next session without any restarts.

alexggh commented 7 months ago

this morning I got C+ and I see these in the logs:

If you looked at the session your peers got 1000 points and 880 points and you got 800, so I think that is just normal when there is higher latency between peers. I don't think this is a problem, the 0 points sessions are a problem when you enter the active set and I'm trying to root cause that.

Why is It so important? In most cases a missed vote ratio becomes normal in the next session without any restarts.

I explained here: https://github.com/paritytech/polkadot-sdk/issues/3314#issuecomment-1993692229 the failure modes we know, so far, not persistenting. the peer-id will cause problems.

dcolley commented 7 months ago

I think the peerId is crucial, as we publish this in our service file:

--public-addr /ip4/31.22.13.147/tcp/28032/p2p/12D3KooWQerRy4Ccb1mMk1ZcsCHMJXhYj1tC5seoAPTMWx6cVMkN

I don't have the peerId in the command flags, I expect the node uses chains/ksmcc3/network/secret_ed25519 for the Id After restart the service prints the node id

│2024-03-13T10:11:33.954640+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 Can't use warp sync mode with a partially synced database. Reve│
rting to full sync mode.                                                                                                                              │
│2024-03-13T10:11:33.955718+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 🏷  Local node identity is: 12D3KooWQerRy4Ccb1mMk1ZcsCHMJXhYj1tC│
5seoAPTMWx6cVMkN                                                                                                                                      │
│2024-03-13T10:11:33.955780+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 🔍 Discovered new external address for our node: /ip4/31.22.13.│
47/tcp/28032/p2p/12D3KooWQerRy4Ccb1mMk1ZcsCHMJXhYj1tC5seoAPTMWx6cVMkN/p2p/12D3KooWQerRy4Ccb1mMk1ZcsCHMJXhYj1tC5seoAPTMWx6cVMkN                        │
│2024-03-13T10:11:33.955806+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 Warp sync failed. Continuing with full sync.                   │
│2024-03-13T10:11:33.975421+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 💻 Operating system: linux                                     │
│2024-03-13T10:11:33.975540+00:00 ksm-validator-1 polkadot[42634]: 2024-03-13 10:11:33 💻 CPU architecture: x86_64                                    │
│
alexggh commented 7 months ago

What I can see from my logs is that every time I had an F session it was due to peer numbers under 20. As soon as it recovered to over 20 I had again an A+ session.

So, if I understand it correctly everytime you got an F session, the following happened:

@tunkas How many times did this happen to you ?

@dcolley any chance you can look at this metric polkadot_parachain_peer_count{protocol=~"validation/.*"} around that time ?

tunkas commented 7 months ago

All of the above is correct, except that peer number was under 20, around 12-16. After it went offline twice during that session, and restarted, it got A+ inthe next session when peer number recovered to 20+ peers.

I think it happened 2 times before this one, on previous versions (16 an 17) and it was supposed to be fixed in recent versions.

alexggh commented 7 months ago

All of the above is correct, except that peer number was under 20, around 12-16. After it went offline twice during that session, and restarted, it got A+ inthe next session when peer number recovered to 20+ peers.

I'm confused a bit when you are in the active set it should jump up to 1000 if you're node is correctly connected to the network, something like this: https://github.com/paritytech/polkadot-sdk/issues/3314#issuecomment-1991637420. Are you saying your metric is always 20, not matter if you are in the active set or not ?

validorange commented 7 months ago

Thanks! We found one issue with changing PeerIds. If you restart the node, do you get the same PeerId everytime?

Checking back through my logs, yes, the PeerId is the same on every restart. I don't know if it is important but I will mention that the validator is running on a non-standard port (40333 instead of 30333). There is another validator (Polkadot) running on 30333.

/var/log/syslog.1:Mar  6 20:00:58 vo-ash1 polkadot[1785142]: 2024-03-06 20:00:58 🔍 Discovered new external address for our node: /ip4/168.245.232.4/tcp/40333/p2p/12D3KooWDKVSeGjvigC5b9qUo3YrnDaBzDUBKtAFN4xNexTZnoNh
/var/log/syslog.1:Mar  7 19:04:21 vo-ash1 polkadot[2066807]: 2024-03-07 19:04:21 🔍 Discovered new external address for our node: /ip4/168.245.232.4/tcp/40333/p2p/12D3KooWDKVSeGjvigC5b9qUo3YrnDaBzDUBKtAFN4xNexTZnoNh
/var/log/syslog.2.gz:Feb 29 18:01:58 vo-ash1 polkadot[987]: 2024-02-29 18:01:58 🔍 Discovered new external address for our node: /ip4/168.245.232.4/tcp/40333/p2p/12D3KooWDKVSeGjvigC5b9qUo3YrnDaBzDUBKtAFN4xNexTZnoNh
/var/log/syslog.3.gz:Feb 19 16:54:39 vo-ash1 polkadot[3409746]: 2024-02-19 16:54:39 🔍 Discovered new external address for our node: /ip4/168.245.232.4/tcp/40333/p2p/12D3KooWDKVSeGjvigC5b9qUo3YrnDaBzDUBKtAFN4xNexTZnoNh

I only have one "F" session recently but it matches the pattern tunkas described: the "bad" session started with 17 peers, the next paravalidator session started with 23 peers and was good ("A+"). I did notice way back when 1.0.0 came out that maintaining 40 peers was no longer common. Before that the node consistently had 40 peers.

alexggh commented 7 months ago

20 - 40 connected peers is definitely not the numbers you guys should see when you are inside the active set, that should be close to 1000, I'm surprise you get any A+ with that low connectivity, so either we are not talking about the same metric we use something like this to plot it in grafana avg(polkadot_parachain_peer_count{chain="$chain", node=~".*validator.*"}) by(protocol) or something really nefarious is going one here, are you guys able to run with -lparachain=debug and provide all the logs till the time you hit the issue ?

tunkas commented 7 months ago

We're obviously not talking about same metric. Peers I'm talking about are visible in almost every line on our logs and I never saw it higher than 40, as Orange mentioned above.

validorange commented 7 months ago

The polkadot binary defaults to 8 outgoing connections and 32 inbound, for a combined maximum of 40.

From polkadot --help:

      --out-peers <COUNT>
          Number of outgoing connections we're trying to maintain

          [default: 8]

      --in-peers <COUNT>
          Maximum number of inbound full nodes peers

          [default: 32]

This is the peer count I'm referring to, which I understand to be the peers I have a direct tcp connection with, which is likely a subset of the active validator set. Actually that is an interesting question -- if my validator is not directly talking to the 2 other members of the validator group to which it belongs, can it participate? Is it failing to connect to those specific validators for some reason?

alexggh commented 7 months ago

This is the peer count I'm referring to, which I understand to be the peers I have a direct tcp connection with, which is likely a subset of the active validator set. Actually that is an interesting question -- if my validator is not directly talking to the 2 other members of the validator group to which it belongs, can it participate? Is it failing to connect to those specific validators for some reason?

No, that's on top of the reserved_peer_set and in the reserved peer-set for the validation protocol we add all authorithies in the active set + the next ones and the ones in the past few sessions. So your node will always be connected to all authorities in the active set.

Peers I'm talking about are visible in almost every line on our logs and I never saw it higher than 40, as Orange mentioned above.

Which logs and how are you counting that ?

dcolley commented 7 months ago

@dcolley any chance you can look at this metric polkadot_parachain_peer_count{protocol=~"validation/.*"} around that time ?

I did not collect stats in prometheus on this new server. I am now collecting stats and I'll revert as soon as I have anything further

validorange commented 7 months ago

OK, netstat -an | grep ESTABLISHED | wc confirms I have about 1700 connections on this host (with 1 KSM and 1 DOT validator running) so this checks out. This suggests that the peer count reported in syslog is not very meaningful.

Every 5 seconds the polkadot binary logs a message like this one to syslog:

Mar 13 14:46:58 vo-ash1 polkadot[1011]: 2024-03-13 14:46:58 💤 Idle (17 peers), best: #19882624 (0x1457…0e8d), finalized #19882620 (0xa6c2…14f0), ⬇ 506.7kiB/s ⬆ 321.5kiB/s

I have restarted my node with -lparachain=debug, however I haven't seen the issue frequently so I'm not sure when it might pay off.

alexggh commented 7 months ago

I have restarted my node with -lparachain=debug, however I haven't seen the issue frequently so I'm not sure when it might pay off.

Thank you, let me know when/if happens, I can't reproduce it on our test networks, so any data I get on this is good data.

Idle (17 peers)

Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements.

validorange commented 7 months ago

image

So I noticed my monitoring software is tracking tcp connection stats, this chart shows a spike starting right around the beginning of session 37483, the one my validator struggled with. Both ESTABLISHED and TIME_WAIT jump at the beginning of the session. The TIME_WAIT count stays high for one hour then drops back down to normal levels. tcp connections stay in TIME_WAIT for 60 seconds after they are closed, so this steady high number for a full hour suggests maybe a lot of churn in tcp connections for some reason? Maybe my node was somehow triggering bans/disconnects from the nodes it was talking to?

The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?

Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 👶 New epoch 37483 launching at block 0xc12a…4a9b (block slot 284983479 >= start slot 284983479).
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 👶 Next epoch starts at slot 284984079
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 ✨ Imported #22203701 (0xc12a…4a9b)
Mar  8 12:27:54 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:54 Node uses the following topology indices index_in_gossip_topology=Some(ValidatorIndex(162)) index_in_parachain_authorities=Some(Active(ValidatorIndex(162)))
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 👶 New epoch 37483 launching at block 0x1366…0905 (block slot 284983479 >= start slot 284983479).
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 👶 Next epoch starts at slot 284984079
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 ♻️  Reorg on #22203701,0xc12a…4a9b to #22203701,0x1366…0905, common ancestor #22203700,0x089c…d11e
Mar  8 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-08 12:27:55 ✨ Imported #22203701 (0x1366…0905)
validorange commented 7 months ago

I found a different kusama session where my validator participated (successfully) and one of the other nodes in the group scored 0. This was also a case where the 'bad' validator was just joining the active set. Take a look at https://apps.turboflakes.io/?chain=kusama&app=onet#/validator/5CHw7xnARj89p8xQAC86d5e7mjU6jFgsApbM8PWxp5NKbV1k, session 37531.

My node, which was in the same validator group as the bad node, has logs shown below. 1) This was another case where two different block hashes were logged as the start of the epoch and 2) There is an additional log entry here about "Trying to remove unknown reserved node..." which might be referencing the validator that got a 0 score for the session. 3) The new authority set isn't applied by my node until 15 seconds later. Perhaps it is rejecting the other member of the validator group because it does not yet recognize it as a member of the authority set?


Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 👶 New epoch 37531 launching at block 0xc712…ec70 (block slot 285012279 >= start slot 285012279).
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 👶 Next epoch starts at slot 285012879
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 ✨ Imported #22232385 (0xc712…ec70)
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 Trying to remove unknown reserved node 12D3KooWMvwDw55sdVK5e5A8qFPthp3SQDk1RW4Bznd7PMf8D9Pd from SetId(3).
Mar 10 12:27:55 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:55 Node uses the following topology indices index_in_gossip_topology=Some(ValidatorIndex(225)) index_in_parachain_authorities=Some(Active(ValidatorIndex(225)))
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 👶 New epoch 37531 launching at block 0xc397…dfd5 (block slot 285012279 >= start slot 285012279).
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 👶 Next epoch starts at slot 285012879
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 ♻️  Reorg on #22232385,0xc712…ec70 to #22232385,0xc397…dfd5, common ancestor #22232384,0x4197…23f3
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 ✨ Imported #22232385 (0xc397…dfd5)
Mar 10 12:27:56 vo-ash1 polkadot[2066807]: 2024-03-10 12:27:56 💤 Idle (25 peers), best: #22232385 (0xc397…dfd5), finalized #22232382 (0x4bb9…3a7a), ⬇ 3.5MiB/s ⬆ 5.4MiB/s
Mar 10 12:28:00 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:00 ✨ Imported #22232386 (0x4088…0257)
Mar 10 12:28:00 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:00 ✨ Imported #22232386 (0x7b76…309b)
Mar 10 12:28:01 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:01 💤 Idle (25 peers), best: #22232386 (0x4088…0257), finalized #22232383 (0x4cdf…e167), ⬇ 3.7MiB/s ⬆ 6.3MiB/s
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 💤 Idle (25 peers), best: #22232386 (0x4088…0257), finalized #22232384 (0x4197…23f3), ⬇ 3.0MiB/s ⬆ 3.8MiB/s
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 ✨ Imported #22232387 (0x6073…3506)
Mar 10 12:28:06 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:06 ✨ Imported #22232387 (0xdd7c…cfec)
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 👴 Applying authority set change scheduled at block #22232385
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 👴 Applying GRANDPA set change to new set with 1000 authorities
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 🥩 New Rounds for validator set id: 2953 with session_start 22232385
Mar 10 12:28:10 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:10 🥩 Concluded mandatory round #22232385
Mar 10 12:28:11 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:11 💤 Idle (25 peers), best: #22232387 (0x6073…3506), finalized #22232385 (0xc397…dfd5), ⬇ 4.2MiB/s ⬆ 4.3MiB/s
Mar 10 12:28:12 vo-ash1 polkadot[2066807]: 2024-03-10 12:28:12 ✨ Imported #22232388 (0x85de…293c)