Closed alexggh closed 2 months ago
Looks like the PoV
is invalid?
These are included candidates and the node raising the dispute because of this reason was the only one voting invalid, so I'm not sure how the PoV can be invalid.
Hello, here are some more data that you requested: Version at that time was 1.15.0, right now it is 1.15.1. Hardware specs are dedicated server with 6c/12t CPU, 64 GB of RAM and 2 x 1 TB NVME. Bare metal with only Polkadot running on it, didn't have any issues before. Command line I'm using to start a node is: ExecStart=/home/admin/polkadot-sdk/target/release/polkadot --validator --name "Carpediem" --chain=polkadot --database=paritydb --sync=warp --pruning=1000 --telemetry-url 'wss://telemetry-backend.w3f.community/submit/ 1' I have a script running on the node that restarts polkadot every 60 hours, but it seems that it didn't help and that node came back to normal itself. You can check the attached log. out.zip
From the logs the blocks that triggered the disputes, I don't see anything unsual with them.
First: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-interlay.luckyfriday.io%2F#/explorer/query/5913045 Last: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-interlay.luckyfriday.io%2F#/explorer/query/0xe474e99c1f2ce25b0bfe89eb6e6ee673ede773d336bd6b5186080d5b4bd346bf
Look a bit more closer on the provided logs, some timeline:
Node starts out of the sudden to fail validation for all block on parachain 2032
Aug 16 00:43:07 n36bc68 polkadot[2629030]: 2024-08-16 00:43:06 Failed to validate candidate para_id=Id(2032) error=Invalid(WorkerReportedInvalid(
Last validation failure:
Aug 16 11:45:48 n36bc68 polkadot[2629030]: 2024-08-16 11:45:48 Failed to validate candidate para_id=Id(2032)
Node gets restarted because of this:
Aug 16 11:49:20 n36bc68 polkadot[2629030]: 2024-08-16 11:49:20 🥩 Error: RequestsReceiverStreamClosed. Terminating.
Aug 16 11:49:21 n36bc68 polkadot[2629030]: 2024-08-16 11:49:21 Essential task `beefy-gadget` failed. Shutting down service.
Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 Parity Polkadot
Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 ✌️ version 1.15.0-743dc632fd6
Node recovered and does not fail to validate parachain 2032
The node actually recovered because of a restart and given that after restart all PVFs get recompiled I tend to think the problem might have been caused by a corruption of the PVF artefact for parachain 2032
which then lead it to fail validation on all candidates for that parachain.
With a zombienet I managed to get the validator into a state where it continuously disputes candidates for a given parachain by manually corrupting its PVF artefact, so from the data we got I'm inclined to think that's what happened here.
Also, I'm not sure there is any new data that we can obtained at this moment, because once the node restarts the artefacts are gone, so the next steps here would be:
Mitigate and address this corruption possibility with https://github.com/paritytech/polkadot-sdk/issues/5441 or something similar.
@Jsdiem keep an eye on your node, if you seen this error coming back, although if the theory is right it is unlikely it will, please archive parachains
and pvf-artifacts
folders from the database path and share them with us.
Closing this mitigation should be implemented with: https://github.com/paritytech/polkadot-sdk/issues/5441.
The following node was raising a lot of disputes on parachain 2032 because of the bellow error, full logs attached:
carpediem.txt
Point of contact: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17240809840JLZsS:parity.io?via=parity.io&via=corepaper.org&via=matrix.org.
Related to: https://github.com/paritytech/project-mythical/issues/213