Closed anignatev closed 6 months ago
Yeah, those are the connections on a different protocol, not the ones used for gossiping backed statements.
OK, so we were definitely talking about different peers :)
The other detail I noticed is that the validator logged two different hashes for the starting block for this session. Looking through the logs, that does happen occasionally. I don't know if that is significant or not. (Note the lines that say "New epoch 37483 launching at...") Any chance that the paravalidator work would be using the old/wrong hash, and therefore unable to participate with peers?
Yes, I noticed the same the consensus was that it shouldn't matter, but now that it happened twice I'm not that sure anymore.
Interesting enough if you look at session 37531 you will find a 3 more nodes in the same situation so I don't think that's a coincidence anymore.
Regarding the high TIME_WAIT
indicates that the node is terminating a lot of connections. The purpose of this state is to ensure reliability. We make sure to receive the ack of the connection termination.
Understood about TIME_WAIT. I think the fact a lot of connections are being terminated might be significant, since it started at the beginning of the bad session and ended as soon as the session was over. Just speculating, but perhaps my validator is repeatedly trying to connect and one or more other validators are repeatedly disconnecting it for some reason.
I wonder if the log message I pointed out in my previous comment is significant? Trying to remove unknown reserved node...
I see this repeating in my logs about every 6 hours, which lines up with new authority sets. But it is consistently logged 15-20 seconds before the authority set is applied. Sometimes it is logged just once, other times it logs dozens of times (all for different node hashes).
Trying to remove unknown reserved node
Yeah, ignore that one, it is a red-herring it has been there since like forever.
I left this server running, so the node-Id is not changed, and we just got another F when getting active https://apps.turboflakes.io/?chain=kusama#/validator/HyLisujX7Cr6D7xzb6qadFdedLt8hmArB6ZVGJ6xsCUHqmx?mode=history
What stats can I share from prom?
Other validators in the group: ANAMIX/03 is on 1.7.1 I can't tell what the other node is running
@dcolley ok, what timezone are you having 37627
seems to have started at 12:27:55.811
UTC is that equivalent with 13:30 -> 14.30
on your dashboard ?
Could you please provide the logs as well.
I can't tell what the other node is running
They are running versions with async backing because they are getting points.
@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times
@alexggh Prometheus dashboard running on UK server, and running UTC on the node - will load logs tonight will share logs covering all times
Ok, that's even worse seems like for 2h your node connectivity is really bad from 12:30 until 14:30 when it reaches the expected peer-count.
Also don't want to be a nag but can you double check in right hand corner in your browser the used timezone, I know that bit me several times, grafana usually uses the timezone of your connecting device. I would have expected 13:30 or 14:30 to be the time when your node enters the active set.
@dcolley: Can you also get me the values for the bellow metric around the time this problem happened? Thank you!
substrate_authority_discovery_known_authorities_count{}
Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files
substrate_authority_discovery_known_authorities_count{}
Does this help?
Does this help?
Yes, they confirm the findings from here: https://github.com/paritytech/polkadot-sdk/issues/3613#issuecomment-2002450584.
Hi guys,
There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.
There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.
Yes, the problem has been there since forever, it will get fixed with: https://github.com/paritytech/polkadot-sdk/pull/3722 in the next runtime release for kusama.
On a side note, there is no reason for you to run 1.5.0
anymore, because it won't help.
There seems to be a problem with version 1.5.0 as well. Our Kusama node went into active set today, but missed all votes in first session. Logs are attached.
Yes, the problem has been there since forever, it will get fixed with: #3722 in the next runtime release for kusama.
On a side note, there is no reason for you to run
1.5.0
anymore, because it won't help.
This is great news! Yes, we have already uncommented the update scripts... Thank you!
We are running on 1.9.0 and I just got another F. Node ID has not changed and the upgrade was done days ago. We had an A+ since the upgrade, I guess we were not paravalidating for that session. https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history
We are running on 1.9.0 and I just got another F. Node ID has not changed and the upgrade was done days ago. We had an A+ since the upgrade, I guess we were not paravalidating for that session. https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ?mode=history
The issue with low grades in the first session after becoming active will be fixed with the 1.2 runtime upgrade.
1.2 runtime got deployed with the fix of this issue, closing it.
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
Starting from version 1.6.0 node massively misses paravalidation votes. The problem is in versions 1.6.0, 1.7.0, 1.7.1, 1.8.0. Returned version 1.5.0 and everything works correctly.
Attached are the logs of the validator node: 18:36 The node became active, version 1.8.0 was running. 18:44 We have returned version 1.5.0.
rustup show
Default host: x86_64-unknown-freebsd rustup home: /root/.rustup
installed toolchains
stable-x86_64-unknown-freebsd (default) nightly-x86_64-unknown-freebsd
installed targets for active toolchain
wasm32-unknown-unknown x86_64-unknown-freebsd
active toolchain
stable-x86_64-unknown-freebsd (default) rustc 1.76.0 (07dca489a 2024-02-04)
P.S. Other linux-based members have also complained about a similar problem.
polkadot.log.gz
Steps to reproduce