paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

Node massively misses paravalidation votes. #3613

Closed anignatev closed 6 months ago

anignatev commented 7 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Starting from version 1.6.0 node massively misses paravalidation votes. The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0. Returned version 1.5.0 and everything works correctly.

Attached are the logs of the validator node: 18:36 The node became active, version 1.8.0 was running. 18:44 We have returned version 1.5.0.


rustup show

Default host: x86_64-unknown-freebsd rustup home: /root/.rustup

installed toolchains

stable-x86_64-unknown-freebsd (default) nightly-x86_64-unknown-freebsd

installed targets for active toolchain

wasm32-unknown-unknown x86_64-unknown-freebsd

active toolchain

stable-x86_64-unknown-freebsd (default) rustc 1.76.0 (07dca489a 2024-02-04)

P.S. Other linux-based members have also complained about a similar problem.

polkadot.log.gz

IMG_7557

Steps to reproduce

  1. Compile the version containing the bug from the sources.
  2. Launch the version and wait for activity your node.
  3. Analyze the result of processing paravalidation sessions with statistics system (for example: apps.turboflakes.io).
tunkas commented 7 months ago

Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements.

OK, so we were definitely talking about different peers :)

alexggh commented 7 months ago

The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?

Yes, I noticed the same the consensus was that it shouldn't matter, but now that it happened twice I'm not that sure anymore.

Interesting enough if you look at session 37531 you will find a 3 more nodes in the same situation so I don't think that's a coincidence anymore.

sandreim commented 7 months ago

Regarding the high TIME_WAIT indicates that the node is terminating a lot of connections. The purpose of this state is to ensure reliability. We make sure to receive the ack of the connection termination.

validorange commented 7 months ago

Understood about TIME_WAIT. I think the fact a lot of connections are being terminated might be significant, since it started at the beginning of the bad session and ended as soon as the session was over. Just speculating, but perhaps my validator is repeatedly trying to connect and one or more other validators are repeatedly disconnecting it for some reason.

I wonder if the log message I pointed out in my previous comment is significant? Trying to remove unknown reserved node... I see this repeating in my logs about every 6 hours, which lines up with new authority sets. But it is consistently logged 15-20 seconds before the authority set is applied. Sometimes it is logged just once, other times it logs dozens of times (all for different node hashes).

alexggh commented 7 months ago

Trying to remove unknown reserved node

Yeah, ignore that one, it is a red-herring it has been there since like forever.

dcolley commented 7 months ago

I left this server running, so the node-Id is not changed, and we just got another F when getting active https://apps.turboflakes.io/?chain=kusama#/validator/HyLisujX7Cr6D7xzb6qadFdedLt8hmArB6ZVGJ6xsCUHqmx?mode=history

dcolley commented 7 months ago

What stats can I share from prom?

image image
dcolley commented 7 months ago

Other validators in the group: ANAMIX/03 is on 1.7.1 I can't tell what the other node is running

image
alexggh commented 7 months ago

@dcolley ok, what timezone are you having 37627 seems to have started at 12:27:55.811 UTC is that equivalent with 13:30 -> 14.30 on your dashboard ?

Could you please provide the logs as well.

I can't tell what the other node is running

They are running versions with async backing because they are getting points.

dcolley commented 7 months ago

@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times

alexggh commented 7 months ago

@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times

Ok, that's even worse seems like for 2h your node connectivity is really bad from 12:30 until 14:30 when it reaches the expected peer-count.

Also don't want to be a nag but can you double check in right hand corner in your browser the used timezone, I know that bit me several times, grafana usually uses the timezone of your connecting device. I would have expected 13:30 or 14:30 to be the time when your node enters the active set.

alexggh commented 7 months ago

@dcolley: Can you also get me the values for the bellow metric around the time this problem happened? Thank you!

substrate_authority_discovery_known_authorities_count{}
alexggh commented 7 months ago

Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files

dcolley commented 7 months ago

substrate_authority_discovery_known_authorities_count{}

Does this help?

image

alexggh commented 7 months ago

Does this help?

Yes, they confirm the findings from here: https://github.com/paritytech/polkadot-sdk/issues/3613#issuecomment-2002450584.

anignatev commented 7 months ago

Hi guys,

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

IMG_7640

kusama_1.5.0.log.gz

alexggh commented 7 months ago

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

Yes, the problem has been there since forever, it will get fixed with: https://github.com/paritytech/polkadot-sdk/pull/3722 in the next runtime release for kusama.

On a side note, there is no reason for you to run 1.5.0 anymore, because it won't help.

anignatev commented 7 months ago

There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.

Yes, the problem has been there since forever, it will get fixed with: #3722 in the next runtime release for kusama.

On a side note, there is no reason for you to run 1.5.0 anymore, because it won't help.

This is great news! Yes, we have already uncommented the update scripts... Thank you!

dcolley commented 7 months ago

We are running on 1.9.0 and I just got another F. Node ID has not changed and the upgrade was done days ago. We had an A+ since the upgrade, I guess we were not paravalidating for that session. https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history

eskimor commented 7 months ago

We are running on 1.9.0 and I just got another F. Node ID has not changed and the upgrade was done days ago. We had an A+ since the upgrade, I guess we were not paravalidating for that session. https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history

The issue with low grades in the first session after becoming active will be fixed with the 1.2 runtime upgrade.

alexggh commented 6 months ago

1.2 runtime got deployed with the fix of this issue, closing it.