paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 639 forks source link

Paravalidators authoring blocks but missing all votes #4991

Open SimonKraus opened 2 months ago

SimonKraus commented 2 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

This issue is going on for some months already now on both networks - Polkadot and Kusama.

It has been discussed some times in Matrix and @bLd75 dropped something related to stackexchange.

Problem: Validators are proposing relaychain blocks that are accepted. The Validator in does not vote on any parachain blocks (neither implicit nor explicit) and misses all votes resulting in 0 backing points.

At the time of this happening there are no weird metrics going on on a quick glance and there's nothing suspicious in the logs.

KusamaValidatorsNotParavoting

Currently this affects 20 Validators in the active Kusama sets according to Turboflakes Dashboard.

I've myself experienced this issue multiple times and spoke to many fellow operators and despite some things work for in some cases (like restarting the node, rotating the keys, changing key permissions) this issue is still very much unpredictable to me.

Steps to reproduce

unverifiableRandomnessFunction()

alexggh commented 2 months ago

I've been tracking this type of bugs down, some failure modes that we have been observing in the order of impact:

  1. Validators running pre-async backing polkadot versions. [Nothing can be done here, they will get 0 points]
  2. Validators are not reachable by the rest of validators/collators, because they change public IP, PeerId. [Nothing can be done here, they recover by themselves after the new records propagate through the system]

I'm not excluding the possibility of some other eluding bug, can you provide us with a validator that you control where this happened and all the logs you can provide around that time(the more the merrier), this should help us understand if it is one of the above or maybe something else.

SimonKraus commented 2 months ago
  1. Validators are not reachable by the rest of validators/collators, because they change public IP, PeerId. [Nothing can be done here, they recover by themselves after the new records propagate through the system]

I've seen this "automatically resolving" on some nodes without manual interaction, is there an automatic peerId-reneval somehow without renewing the network secret or restarting the node?

I'm not excluding the possibility of some other eluding bug, can you provide us with a validator that you control where this happened and all the logs you can provide around that time(the more the merrier), this should help us understand if it is one of the above or maybe something else.

I'll ask Saxemberg who's currently having these kinds of issues 🙏

Thanks for looking into it

alexggh commented 2 months ago
  1. Validators are not reachable by the rest of validators/collators, because they change public IP, PeerId. [Nothing can be done here, they recover by themselves after the new records propagate through the system]

I've seen this "automatically resolving" on some nodes without manual interaction, is there an automatic peerId-reneval somehow without renewing the network secret or restarting the node?

No, the PeerID can't change without restart, not sure about the public IP, since it really depends on your setup, providing the full polkadot starting cmdline would be of great help as well.

SimonKraus commented 2 months ago

I can tell that in my case it was not restarted and IP change can't happen (own /26 subnet in my colocation rack).

The startup command was pretty basic:

ExecStart=/var/lib/polkadot/polkadot \
     --validator \
     --state-pruning 1000 \
     --blocks-pruning 1000 \
     --database paritydb \
     --base-path /var/lib/polkadot \
     --chain polkadot \
     --prometheus-external 

And it happened on latest release.

m-saxemberg commented 2 months ago

We want to report the same issue on one of our Kusama validators. The command was similar to Simon's. We tried some proposed solutions like setting new permissions as well as just waiting for the automatic resolution but they didn't work. We have the chain file available upon request if that's required. We are currently in touch with the data center where our Kusama nodes are colocated because we have a suspicion of a ddos attack on their infrastructure which might have something to do with our issues but we don't have confirmation of that just yet.

alexggh commented 2 months ago

@m-saxemberg To help us understand what is going on can you share:

  1. Link to your validator in app.turboflakes.
  2. All logs you have for the said validator.
  3. What polkadot version you are running.
  4. Anything else you consider relevant regarding operating the validator(restarts, setup, etc.)
paradox-tt commented 2 months ago

Might as well note that I'm seeing the same on my Kusama validators, most recent as yesterday. In my case a restart helps.

alexggh commented 2 months ago

@paradox-tt can you share the information from the above, the more logs you have the better.

CertHum-Jim commented 1 month ago

Seeing this right now in our Kusama validator Turboflakes

This validator was working fine. Yesterday I rebooted the server (Ubuntu 22.04 LTS), and the service failed to start. I needed to regenerate the network key and used --unsafe-force-node-key-generation \ option. I restarted again after that with the flag removed and it started fine. I'll try to get some logs and leave it misbehaving for now so anyone can see.

Edit - adding service file

`[Unit] Description=Polkadot Node

[Service] User=kusama_service ExecStart=/var/lib/kusama-data/polkadot \ --name CertHum-MaxStake-sv-validator-1 \ --validator \ --in-peers 50 \ --out-peers 50 \ --unsafe-force-node-key-generation \ -- this is commented out now -- --trie-cache-size 0 \ --db-cache 8000 \ --pruning 256 \ --public-addr=/ip4/some IP/tcp/some port \ --listen-addr=/ip4/0.0.0.0/tcp/some port \ --wasm-execution Compiled \ --sync warp \ --rpc-methods=Unsafe \ --base-path /mnt/data/polkadot \ --rpc-port some port \ --prometheus-port some port \ -lsync=warn,afg=warn,babe=warn \ --chain=kusama \ --telemetry-url 'wss://telemetry-backend.w3f.community/submit/ 1' Restart=always RestartSec=90

[Install] WantedBy=multi-user.target `

alexggh commented 1 month ago

and the service failed to start. I needed to regenerate the network key and used --unsafe-force-node-key-generation

Getting 0 points after using --unsafe-force-node-key-generation is more or less expected that's why it is marked as --unsafe, it will take a while(36h), but it should recover by itself when your old identity expires from the network.

Now the question is, why did you have to use --unsafe-force-node-key-generation it feels like your network key was not persisted after restart . Did you check the content of /mnt/data/polkadot you should have had the key in there if you restarted the node on the same machine with the same arguments.

CertHum-Jim commented 1 month ago

and the service failed to start. I needed to regenerate the network key and used --unsafe-force-node-key-generation

Getting 0 points after using --unsafe-force-node-key-generation is more or less expected that's why it is marked as --unsafe, it will take a while(36h), but it should recover by itself when your old identity expires from the network.

Now the question is, why did you have to use --unsafe-force-node-key-generation it feels like your network key was not persisted after restart . Did you check the content of /mnt/data/polkadot you should have had the key in there if you restarted the node on the same machine with the same arguments.

Here are the logs, and thinking back to yesterday, I don't think I gracefully shutdown the service before restarting, so maybe that caused the problems and this is a red herring, because the db needed to be resyched again when it came up, too. https://pastebin.com/VZMf7Z6B

(although still seems weird a db corruption would blow away the key, too, i did not check the contents of that path before regenerating the key)

CertHum-Jim commented 1 month ago

Turboflakes

@alexggh this node is still not paravalidating after 72hrs+. Is there a recommended course of action?

mchaffee commented 1 month ago

This has dinged us (LuckyFriday) several times – most recently today. Prepping logs to post.

mchaffee commented 1 month ago

log.txt

SimonKraus commented 1 month ago

additionally I don't think that https://github.com/paritytech/polkadot-sdk/commit/6720279fb3aac12ad525785d2366cdf30da4d78c will solve the issue ultimately, as most of the people affected by this issue DID NOT delete/regenerate their node secret or change their IP.

It seems most likely to happen if a node re-enters from the waiting list (so it's frequently affecting 1KV validators)

alexggh commented 1 month ago

log.txt

@mchaffee your polkadot validator https://apps.turboflakes.io/?hain=polkadot#/validator/14AakQ4jAmr2ytcrhfmaiHMpj5F9cR6wK1jRrdfC3N1oTbUz, seems to be getting load of points now did you do anything to fix it I assume just the restarts I see in the logs ?.

authority="14DsoQ7hnQFPs2waL7MMRFhQJanWtkzkygKuYcrbWcvmy2Ld" peer_id=PeerId("12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2") addresses="Parity Polkadot/v1.15.0-743dc632fd6 (🍀LuckyFriday-EU-DOT-01🍀)" version={"/ip4/10.48.113.226/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.12.148/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/100.64.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.148.14.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.50.110.6/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.10.14/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.97.240/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.118.255/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.16.0.23/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/127.0.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.109.61/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.19.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.3.85/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.19.10.59/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.125.137/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.101.247/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/100.64.4.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.126.8/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.23.128/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.102.139/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.103.74/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/199.247.24.97/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.108.27/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.16.51.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.127.148/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.124.121/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.102.155/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.19.13.15/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.15.136/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.23.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.116.95/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.98.169/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.1.88/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.96.220/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.104.66/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.12.202/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.117.135/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.127.147/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.7.62/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.29.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.97.240/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.116.128/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/100.64.2.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.125.111/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.111.171/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.22.0.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.19.27/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/100.64.31.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.101.59/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.16.16.10/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.148.13.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.6.232/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.138.125.59/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.16.0.43/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.116.6/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/100.64.1.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/192.168.50.50/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.118.93/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.122.35/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.19.238/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.50.110.5/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.119.181/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.113.53/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.117.239/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.113.148/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.119.90/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.106.253/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.47.10.62/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/172.19.13.51/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.48.116.39/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.148.12.1/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2", "/ip4/10.129.7.110/tcp/30330/p2p/12D3KooWSEQZzFjj7R57NMgLFjphPPT9KVqrd8kiyvf6XBckWUH2"}  num_exceeded 72 raw 8e86a73630766eeb5df898e509269083fca8dafcd6c821bec17717fa0ff97a24

Anyways I think your problem is caused by the fact that your validator seem to be reporting 73 different IP address, that is bound to create problems with connectivity between validators, not sure how you got into this state(for context no other validator publishes more than 5) probably the information is in the logs before you join the active set, I would expect you would see a lot of:

Discovered new external address for our node

FYI @paritytech/networking, maybe you got an idea how the node might got into this state, I've think we seen this happening on rococo here: https://github.com/paritytech/polkadot-sdk/issues/3519#issuecomment-1994760856.

Do you by any chance run more than on node which might connected between each other using those addresses ?

I think after the restart your node cleared all those accumulated IP address and that's why it is working now.

alexggh commented 1 month ago

Turboflakes

@alexggh this node is still not paravalidating after 72hrs+. Is there a recommended course of action?

@CertHum-Jim looked at your Kusama validator and after running https://github.com/lexnv/subp2p-explorer, it seems like I can't reach it:

authority="GCuezmyWaSubhgjWssjvLqo7sXcGJYadW5DfNTsdje9qtrz" peer_id=PeerId("12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa") addresses={"/ip4/15.235.216.1/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip4/103.240.197.70/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip6/2402:1f00:8001:2201::/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa"} "num 3" raw a09c7441c8db86dafd48e04be28fc6691ecd8d6bf1b32e85db93937b11627125 - Cannot be reached

Given your validator gets 0 backing points and 0 points for authoring relay chain blocks, I would check if you validator is connected and in sync with the relay chain, could you also provide all the logs that you've got and I can try to advice based on that.

alexggh commented 1 month ago

additionally I don't think that 6720279 will solve the issue ultimately, as most of the people affected by this issue DID NOT delete/regenerate their node secret or change their IP.

It seems most likely to happen if a node re-enters from the waiting list (so it's frequently affecting 1KV validators)

Yes, the mentioned commit won't help if addresses did not change or if they are invalid or if there is some other bug at play, because that helps just with propagating newer addresses faster.

It seems most likely to happen if a node re-enters from the waiting list (so it's frequently affecting 1KV validators)

Could you please direct them on this ticket with their logs.

Unfortunately this issue is generic enough and happens occasionally. Anything wrong with the validators will manifest as missing points, so unless we've got logs and specific details about the setup and network condition at the time it happens we won't be able to properly root-cause it.

CertHum-Jim commented 1 month ago

Turboflakes @alexggh this node is still not paravalidating after 72hrs+. Is there a recommended course of action?

@CertHum-Jim looked at your Kusama validator and after running https://github.com/lexnv/subp2p-explorer, it seems like I can't reach it:

authority="GCuezmyWaSubhgjWssjvLqo7sXcGJYadW5DfNTsdje9qtrz" peer_id=PeerId("12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa") addresses={"/ip4/15.235.216.1/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip4/103.240.197.70/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip6/2402:1f00:8001:2201::/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa"} "num 3" raw a09c7441c8db86dafd48e04be28fc6691ecd8d6bf1b32e85db93937b11627125 - Cannot be reached

Given your validator gets 0 backing points and 0 points for authoring relay chain blocks, I would check if you validator is connected and in sync with the relay chain, could you also provide all the logs that you've got and I can try to advice based on that.

The node is online, the chain is synced, but the peer ID that is seen in p2p is not what is on the node (and you can see the correct one in telemetry and part of 1KV -- and when it change in the pastbin logs, when i originally reported the bug.)

Also, in your output, the first IPv4 address and the third are correct. The second I have never seen before and is not from any provider we've ever used -- it's really strange that it is seeing that on the network.

https://pastebin.com/t3FDFigr

CertHum-Jim commented 1 month ago

Turboflakes @alexggh this node is still not paravalidating after 72hrs+. Is there a recommended course of action?

@CertHum-Jim looked at your Kusama validator and after running https://github.com/lexnv/subp2p-explorer, it seems like I can't reach it:

authority="GCuezmyWaSubhgjWssjvLqo7sXcGJYadW5DfNTsdje9qtrz" peer_id=PeerId("12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa") addresses={"/ip4/15.235.216.1/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip4/103.240.197.70/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa", "/ip6/2402:1f00:8001:2201::/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa"} "num 3" raw a09c7441c8db86dafd48e04be28fc6691ecd8d6bf1b32e85db93937b11627125 - Cannot be reached

Given your validator gets 0 backing points and 0 points for authoring relay chain blocks, I would check if you validator is connected and in sync with the relay chain, could you also provide all the logs that you've got and I can try to advice based on that.

Just a little more info, the 103.240.197.70 IP comes from an AS run by Leapswitch and the block seems like it is assigned to Sunday Networks, a HK company. Traceroutes indicate it is located in India. So I'm not really sure how the same network ID can live in both Singapore and India at the same time.

alexggh commented 1 month ago

The node is online, the chain is synced, but the peer ID that is seen in p2p is not what is on the node (and you can see the correct one in telemetry and part of 1KV -- and when it change in the pastbin logs, when i originally reported the bug.)

What do you mean is not correct in both dht and your logs I see 12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa which seems correct to me.

Edit: I see your point you are saying your local node id is actually: 12D3KooWFsxi9vj5HVMm5iFLS6b5C9JSagmWfFRnAGKSsSjJRTGd

Local node identity is: 12D3KooWFsxi9vj5HVMm5iFLS6b5C9JSagmWfFRnAGKSsSjJRTGd

And the DHT shows us the old PeerId, that you lost ?

Also, in your output, the first IPv4 address and the third are correct. The second I have never seen before and is not from any provider we've ever used -- it's really strange that it is seeing that on the network.

You are saying this one: "/ip4/103.240.197.70/tcp/11002/p2p/12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa" is not correct although I see it in your logs, maybe it is some address your node automatically resolves to ?

  Discovered new external address for our node: /ip4/103.240.197.70/tcp/11002/p2p/12D3KooWFsxi9vj5HVMm5iFLS6b5C9JSagmWfFRnAGKSsSjJRTGd

@CertHum-Jim

  1. Do you have more logs than that, like everything since the issue started till now(the more the better) ?
  2. Can you start your node with this logs enabled, just append parachain::gossip-support=debug,sub-authority-discovery=debug to this line in your command line -lsync=warn,afg=warn,babe=warn
CertHum-Jim commented 1 month ago

@CertHum-Jim

  1. Do you have more logs than that, like everything since the issue started till now(the more the better) ?
  2. Can you start your node with this logs enabled, just append parachain::gossip-support=debug,sub-authority-discovery=debug to this line in your command line -lsync=warn,afg=warn,babe=warn

File is here: https://drive.google.com/file/d/1NN3lZyaPdiFu7TIeExjW_VuFeE_eiAL-/view?usp=sharing

Maybe I'm just reading it wrong, but that 103. address sure does seems like it's associated with a lot of peers.

alexggh commented 1 month ago

Maybe I'm just reading it wrong, but that 103. address sure does seems like it's associated with a lot of peers.

Yes, indeed, but I think that's a read herring, because all other nodes work, I think the main problem is why all authorithies still see your old PeerId 12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa and not the new one 12D3KooWFsxi9vj5HVMm5iFLS6b5C9JSagmWfFRnAGKSsSjJRTGd, the old records should have expired by now.

I'll try to figure it out what is happening tomorrow, till then I guess your best try to recover this validator is to change your authority key by doing a key rotation, which was found to be helpful in this kind of cases.

CertHum-Jim commented 1 month ago

Maybe I'm just reading it wrong, but that 103. address sure does seems like it's associated with a lot of peers.

Yes, indeed, but I think that's a read herring, because all other nodes work, I think the main problem is why all authorithies still see your old PeerId 12D3KooWLmo8ohNty4QhFMAyF6a5CKf4jFL5A9kjC56iaWHfCrQa and not the new one 12D3KooWFsxi9vj5HVMm5iFLS6b5C9JSagmWfFRnAGKSsSjJRTGd, the old records should have expired by now.

I'll try to figure it out what is happening tomorrow, till then I guess your best try to recover this validator is to change your authority key by doing a key rotation, which was found to be helpful in this kind of cases.

Ok, thanks for looking into it. Also, i hijacked this issue a bit because I think the original report from @SimonKraus occurred without the network ID being changed on the node, so just adding that to put it back on topic (although maybe the root is related).

We did have that happen once (where no changes were made, just out and in of the set) on our TVP polkadot node and we found the only remedy to deploy a new node with a new network and authority key. But for this instance I'll just rotate and hopefully will start working.