paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.63k stars 566 forks source link

Some missed statements as para-validator in the first session after rotated keys in a new machine #1819

Closed Luca-Poggi closed 1 month ago

Luca-Poggi commented 8 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Hello, I'm Luca from 🧊 Iceberg Nodes 🧊. I have opened this issue because we experienced a little issue about some missed statements when our validators were para-validators in the first session after rotated keys in a new machine.

On 2023-10-05 22:06:54 (+UTC) at block 19983668 in session 33772 we set keys on chain after rotating keys on a new machine for our 🧊 Iceberg Nodes 🧊/V2 (in the following named as "V2") - tx: https://kusama.subscan.io/extrinsic/19983668-2

At session 33776, the first session after set keys and when our V2 was para-validator, it missed some statements, in particular 4/4 statements from a certain parachain (Turing Network in this case), as you can see on: https://apps.turboflakes.io/?chain=kusama#/validator/H6rdnNwvHFKw5tfF7kXSssta5AYmysrJrZkRmAbzw6Vm3p8 by clicking on "History", select last 32 eras, and clicking on the "blue" dot regarding para-validator session 33776. Also please note that the performance of our V2 is optimal (Grade A+) on all para-validator sessions before and after the issue. In order to exclude some performances issues, I waited some elapsed eras before opening this issue.

For now nothing to worry about, could be some network issue between validator and Turing collators at that session.....but the same issue appeared when rotated keys also on our 🧊 Iceberg Nodes 🧊/V1 (in the following named as "V1").

On 2023-10-06 15:14:18 (+UTC) at block 19993927 in session 33789 we set keys on chain after rotating keys on a new machine for our V1 - tx: https://kusama.subscan.io/extrinsic/19993927-2

At session 33791, the first session after set keys and when our V1 was para-validator, it missed some statements, in particular 5/5 from Khala and 4/4 from Bifrost, as you can see on: https://apps.turboflakes.io/?chain=kusama#/validator/Eices1KaGTYqiazfjJpwyjnz5UzqTxULeYqnmeJNz49gs19 by clicking on "History", select last 32 eras, and clicking on the "blue" dot regarding para-validator session 33791. Please note that also in this case the performance of our V1 is optimal (Grade A+) on all para-validator sessions before and after the issue.

For this reason my assumption is the following: there is something related to set keys after rotating keys on a new machine (so a different PeerID) and so it seems that some parachains don't update the list of para-validators (maybe the PeerID) to which pass their produced blocks. As you can see just after one session the issue disappear and all is working as expected.

Let me know if you need some additional informations to inspect the issue. Thank you

Steps to reproduce

1) Run a validator for some sessions on active set 2) Start a new machine with a new node, when synced rotate keys and set keys on chain 3) Wait for the first session where the validator using the new machine is a para-validator 4) You should see some missed statements from some parachains

Luca-Poggi commented 8 months ago

Adding here some screenshots from https://apps.turboflakes.io because the dashboard updates stats very fast and in few days the infos could be lost.

image image image image image image
bkchr commented 8 months ago

@ordian do we may still have somewhere a one to one mapping for validator id to authority discovery session key/peerid? We are returning the current and the next authorities, so authority discovery should discover all the new nodes and also each node should announce it correctly.

ordian commented 8 months ago

I remember @rphmeier had a branch #1436 which included among others some fixes to authority discovery mapping, e.g. https://github.com/paritytech/polkadot-sdk/pull/1436/commits/6319b92e1eec044600c77224e30c84b8ae6fee39. I'll take a look next week.

alexggh commented 1 month ago

This was fixed with: https://github.com/paritytech/polkadot-sdk/pull/3733.