paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 643 forks source link

Polkadot validator stops gaining points #4321

Closed dcolley closed 4 months ago

dcolley commented 4 months ago

With no changes to the installation, this validator stopped gaining points: https://apps.turboflakes.io/?chain=polkadot#/validator/5HgM1fbhs7uRCB9KxNquaFBGLPxmQfYEcHT8GbNSrZ9HRWEY?mode=history

After rotating keys and restarting the node, it started gaining points. In the next session it achieved A+, then stopped again. Next session got an F.

This morning I created a new node, and transferred the network/secret_ed25519. The new node is not accumulating any points (yet)

dcolley commented 4 months ago

syslog from new server syslog.new.tgz

dcolley commented 4 months ago

syslog from original docker (only has logs from most recent docker compose up -d) dot-validator.log.gz

dcolley commented 4 months ago

@alexggh

alexggh commented 4 months ago
2024-04-28T17:39:23.253368+00:00 dot-val-1 polkadot[4093]: 2024-04-28 17:39:23 🏷  Local node identity is: 12D3KooWKCiq7fQsQpfV7vH8vX7YymVRwxkZLsjZWx57m9HkCK26

2024-04-28 17:36:58 🏷  Local node identity is: 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD   

2024-04-29T07:11:47.357237+00:00 dot-val-1 polkadot[143066]: 2024-04-29 07:11:47 🏷  Local node identity is: 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD

I see your node has been running with two different node identities, unfortunately, that is bound to cause problems, I think that's why you are getting 0s now, see https://github.com/paritytech/polkadot-sdk/issues/3673.

I'm not sure when you last rotated your keys, but it seems you are now using an old identity not your newest one, see the timestamps, at 17:36:58 you start as 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD then at 17:39.23 you start with a different ID 12D3KooWKCiq7fQsQpfV7vH8vX7YymVRwxkZLsjZWx57m9HkCK26 then you go back again to 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD.

So, to recover please make sure first you keep your identity stable, then rotate your keys and the issue should go away.

dcolley commented 4 months ago

the timestamp 17:36:58 was when I corrected the --public-addr (it had the wrong nodeId) the logs report "Discovered new external address for our node..." even though it had the wrong ID. the incorrect --public-addr might initially confuse libp2p, but it should eventually settle down. We can see the peers connected, so this can't be the cause of the issue.

the issue occured overnight when the ID was stable and node achieved an A+ and then F.

this morning I moved the node to a new server: Just to check the process, on a new server I sync the node (it will create a new id in network/secret_ed25519). When the node is sync'd, I move secret_ed25519 to secret_ed25519.orig and copy the secret_ed25519 from the old server. Does the node pick up both ids?

alexggh commented 4 months ago

You need your key to be present on the new machine before you start the node, because otherwise if it starts with a new identity and a correct AuthorithyId, nodes will start publish on the DHT the new address and once they do that because of the distributed nature of the DHT where nodes replicate records regularly, there is no determinism as to which address your peers will see, even if you connect to them they might concur that the other PeerID should fulfil the job of this AuthorithyId.

As a matter of fact it is never safe to publish a new network identity while you are in the active set, because it will take 36h for that record to expire.

the issue occured overnight when the ID was stable and node achieved an A+ and then F.

Even if it is stable the old one it will still exist on some node and there is no way to know which one your peers think you are using, hence why a key rotation fixes it, because by changing your AuthorityId you change the key other nodes look up your node in the network.

dcolley commented 4 months ago

I think we should not focus on the actions I took to restart the node. The point of this issue is that a functioning node - with no changes - stopped participating in consensus. image

alexggh commented 4 months ago

I think we should not focus on the actions I took to restart the node. The point of this issue is that a functioning node - with no changes - stopped participating in consensus.

Apologies, if it came across as the wrong way, the action matters because I'm trying to understand what happened, unfortunately with this type of bugs all the details that you can offer us would help us understand what happened and how to fix it.

We know for sure changing the PeerId and the public address of your node, is bound to cause problems until the past record expires 36h latter, even if it seemed your node got to work correctly for a session, the old identity it still cached on other nodes, fixes are underway to make this mistake harder to happen: https://github.com/paritytech/polkadot-sdk/issues/3673.

Looking at app turboflakes, it seems your node started getting Fs at session 8600 which from what I understand is after running the node with two identities, hence why I think it is the same problem as in https://github.com/paritytech/polkadot-sdk/issues/3673.

Now, I guess the problem you want us to focus on, is what happened at session 8598, where I see your node getting D and 50% of the rewards when compared with its peers, could you tell me the timeline there, thank you!

dcolley commented 4 months ago

Here is the timeline: During session 8598 I got alerted (18h02 UK time) that the validator was not accumulating points. I quickly rotated the keys, corrected the --public-addr and restarted the node. Something I did caused the validator to start accumulating points, and I left it to run overnight.

Expecting perhaps it could be hardware/ disk/ cpu/ network card issues: I also started a new node last night to get a new DB warp-sync'd - this node was running under a different ID.

in 8598 it got D (after the actions above) in 8599 it got A+ in 8600 it got F in 8601 it got F in 8602 (this morning) I moved the original ID to the new server

alexggh commented 4 months ago

Scanned the network with https://github.com/lexnv/subp2p-explorer:

Your node seems to be advertising a lot of public addresses, bare in mind that the other nodes would accept only 10 addresses: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/authority-discovery/src/worker.rs#L73

  1. First authority record for 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD.
    authority="1zwDoEydTXsGixYmmTU7NVSEdhhXSUbx2hB7Ff7uAbTNsL8" peer_id=PeerId("12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD") addresses="Parity Polkadot/v1.10.0-7049c3c9883 (METASPAN (also try POOL #18))" version={"/ip4/100.64.3.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.6.232/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.115.196/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.122.94/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.7.233/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.4.5/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.25.166/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.25.233/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.1.205/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.98.169/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.30.114/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.24.254/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.119.220/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.16.172/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.117.0/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.16.10/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.10.59/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.106.194/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.120.223/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.97.240/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.101.185/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.12.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.4.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.121.106/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.125.59/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.14.232/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.100.89/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.101.53/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.24.101/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.115.149/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.122.109/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.108.109/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.102.222/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.118.251/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.125.198/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.7.232/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.117.79/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.103.158/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.8.255/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/192.168.50.50/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.100.107/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.107.244/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.39.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.16.0.112/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.29.243/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.106.140/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.3.119/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.99.177/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.127.211/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.12.124/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.105.201/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.105.140/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.126.51/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.119.177/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.100.110/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.14.92/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.118.207/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.119.222/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.111.103/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.126.151/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.114.204/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.114.127/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.103.86/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.1.91/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/127.0.0.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.2.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/195.144.22.130/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.13.15/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.106.43/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.12.140/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.0.160/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.25.252/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.17.0.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.126.64/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.108.73/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.100.48/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.127.131/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.7.164/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.30.196/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.21.217/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.116.42/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.97.174/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.123.42/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.38.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.120.80/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.11.5/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.12.218/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.122.240/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.117.157/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.42.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.104.66/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.122.128/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.125.172/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.119.101/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.22.230/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.102.208/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.10.13/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/192.168.10.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.0.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.103.60/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.1.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.106.106/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.17.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.118.250/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.121.145/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD"} 
  2. Second authority record for 12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD.
    435:authority="15hnyENt1W8FxBBN4UU8EgPyoc6h5rAG43DnRweSRUTKbkaS" peer_id=PeerId("12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD") addresses="Parity Polkadot/v1.10.0-7049c3c9883 (METASPAN (also try POOL #18))" version={"/ip4/195.144.22.130/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.27.10/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.12.228/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.39.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.25.230/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.38.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.8.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.1.152/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.119.101/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/127.0.0.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.27.203/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.26.0.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/195.144.22.130/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.41.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.40.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.42.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.138.101.232/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.2.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.1.138/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.13.15/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.103.178/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.122.90/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.99.113/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.10.59/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/127.0.0.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.47.8.32/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.16.0.17/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/192.168.10.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.16.0.13/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.0.0.2/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/192.168.10.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.4.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.41.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.148.17.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.129.106.140/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.1.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.17.0.1/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/100.64.1.1/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.10.59/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.19.13.15/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.16.10/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/10.48.102.222/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD", "/ip4/172.16.16.10/tcp/30333/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD"} 

Not sure how you ended-up with so many records of different IPS in DHT, but that is definitely the culprit of your issues. I assume the first AuthorithyID is the one before your rotate the keys and the second one is the one after rotate, that probably worked until enough address accumulated.

Tagging a few more people that know more about networking and infrastructure than me @dmitry-markin, @BulatSaif, @lexnv, @altonen, but this look like a repeat of https://github.com/paritytech/polkadot-sdk/issues/2523 and https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1994760856.

dcolley commented 4 months ago

Wow, that's a really useful tool! I think the DHT is collating these addresses from various gateways, and not pruning. Could this be some form of DDoS, spamming the DHT via libp2p...?

alexggh commented 4 months ago

Could this be some form of DDoS, spamming the DHT via libp2p.

The records are signed with your authority id key, so it is more like DoS-ing yourself. Not sure why your polkadot node decides to publish all those records, probably your pod/machine changes IPs that often ? Could you also post the command-line you use for starting your node ?

dcolley commented 4 months ago

from previous server (docker):

 polkadot-validator:
    container_name: polkadot-validator-1
    image: parity/polkadot:v1.10.0
    restart: unless-stopped
    ports:
      - "15032:30333"  # p2p port
      - "15033:15033"  # prometheus port
      - "15034:9933"  # rpc port
      - "15035:9944"  # ws port
    volumes:
      - /media/nvme-2tb/polkadot-val-1:/data
    command: [
      "--chain", "polkadot",
      "--validator",
      "--name", "METASPAN (also try POOL #18)",
      "--telemetry-url", "wss://telemetry-backend.w3f.community/submit 1",
      "--base-path", "/data",
      "--database", "paritydb",
      "--pruning", "256",
      # "--sync", "warp",
      "--allow-private-ipv4",
      "--discover-local",
      "--listen-addr", "/ip4/0.0.0.0/tcp/30333",
      "--public-addr", "/ip4/195.144.22.130/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD",
      "--prometheus-port", "15033",
      "--prometheus-external",
      # RPC node
      #"--rpc-external",
      "--rpc-methods", "safe",
      #"--rpc-methods", "unsafe",
      "--rpc-cors", "all",

from the new service file:

#  --log sub-libp2p=debug,gossip=debug,afg=debug \

POLKADOT_CLI_ARGS=\
  --chain polkadot \
  --workers-path /usr/local/bin \
  --base-path /data \
  --sync warp \
  --name "METASPAN (also try POOL #18)" \
  --validator \
  --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' \
  --telemetry-url 'wss://telemetry.polkadot.io/submit 1' \
  --listen-addr /ip4/0.0.0.0/tcp/30333 \
  --public-addr /ip4/195.144.22.130/tcp/15032/p2p/12D3KooWDsTWdT8aFfRELxQ4YeEM1TJPVyWco8azKCRvFJJ5STjD \
  --prometheus-port 9615 \
  --prometheus-external \
  --state-pruning 256 \
  --database paritydb \
  --allow-private-ipv4
dcolley commented 4 months ago

ok, so these are all private addresses and would never be reachable by external parties. They probably come from the internal docker network Could be a symptom of --allow-private-ipv4 inside docker? But should they be published to DHT?

/ip4/10.0.0.2/tcp/30333/p2p/
/ip4/10.129.1.138/tcp/30333/p2p/
/ip4/10.129.106.140/tcp/15032/p2p/
/ip4/10.129.27.10/tcp/30333/p2p/
/ip4/10.129.27.203/tcp/30333/p2p/
/ip4/10.138.101.232/tcp/15032/p2p/
/ip4/10.138.119.101/tcp/15032/p2p/
/ip4/10.148.17.1/tcp/15032/p2p/
/ip4/10.148.38.1/tcp/15032/p2p/
/ip4/10.148.39.1/tcp/15032/p2p/
/ip4/10.148.40.1/tcp/15032/p2p/
/ip4/10.148.41.1/tcp/15032/p2p/
/ip4/10.148.41.1/tcp/30333/p2p/
/ip4/10.148.42.1/tcp/15032/p2p/
/ip4/10.16.0.13/tcp/30333/p2p/
/ip4/10.16.0.17/tcp/30333/p2p/
/ip4/10.47.122.90/tcp/30333/p2p/
/ip4/10.47.25.230/tcp/15032/p2p/
/ip4/10.47.8.32/tcp/30333/p2p/
/ip4/10.47.99.113/tcp/30333/p2p/
/ip4/10.48.1.152/tcp/30333/p2p/
/ip4/10.48.102.222/tcp/30333/p2p/
/ip4/10.48.103.178/tcp/30333/p2p/
/ip4/10.48.12.228/tcp/15032/p2p/
/ip4/100.64.1.1/tcp/15032/p2p/
/ip4/100.64.1.1/tcp/30333/p2p/
/ip4/100.64.2.1/tcp/15032/p2p/
/ip4/100.64.4.1/tcp/30333/p2p/
/ip4/127.0.0.1/tcp/15032/p2p/
/ip4/127.0.0.1/tcp/30333/p2p/
/ip4/172.16.16.10/tcp/15032/p2p/
/ip4/172.16.16.10/tcp/30333/p2p/
/ip4/172.16.8.1/tcp/30333/p2p/
/ip4/172.17.0.1/tcp/15032/p2p/
/ip4/172.19.10.59/tcp/15032/p2p/
/ip4/172.19.10.59/tcp/30333/p2p/
/ip4/172.19.13.15/tcp/15032/p2p/
/ip4/172.19.13.15/tcp/30333/p2p/
/ip4/172.26.0.1/tcp/30333/p2p/
/ip4/192.168.10.1/tcp/15032/p2p/
/ip4/192.168.10.1/tcp/30333/p2p/

These 2 could be routable, but only the 1st one (relating to the --public-addr) is correct. The 2nd address comes from the --listen-addr - should this be published to DHT?

/ip4/195.144.22.130/tcp/15032/p2p/
/ip4/195.144.22.130/tcp/30333/p2p/
alexggh commented 4 months ago

This one --discover-local is the one which allows non-global IPs into the DHT.

alexggh commented 4 months ago

@dcolley I see you validator is fixed now, can I go ahead and close this issue ? Also for future us, could share what cli knobs you removed/added to get it fixed. Thank you!

dcolley commented 4 months ago

In the end I set up a new validator with a new ID, then transferred the old ID to the new machine and rotated keys. The true test will be when the validator is in the active set next time, I'll monitor that it becomes active and accumulates points from the start, thanks.