paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

Polkadot node raising disputes because of `Execution aborted due to trap: wasm trap: wasm `unreachable` #5413

Closed alexggh closed 2 months ago

alexggh commented 2 months ago

The following node was raising a lot of disputes on parachain 2032 because of the bellow error, full logs attached:

Aug 16 01:21:42 n36bc68 polkadot[2629030]: WASM backtrace:
Aug 16 01:21:42 n36bc68 polkadot[2629030]: error while executing at wasm backtrace:
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     0: 0x75b2ad - <unknown>!<wasm function 9243>
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     5: 0x4f605a - <unknown>!validate_block
Aug 16 01:21:42 n36bc68 polkadot[2629030]:     6: 0x7582d3 - <unknown>!<wasm function 8889> artifact_id=ArtifactId { code_hash: 0xff0fd1ba1d31b1f05c66ffff10b83d6f6af6c28c006b2ce79dfad246c92ea0f5, executor_params_prep_hash: 0x50b16a713c0f8774ba3a0757c722b48c6eb4c1972f6e58c54fd0851f21a08633 } worker=Worker(1v1) worker_rip=false
Aug 16 01:21:42 n36bc68 polkadot[2629030]: 2024-08-16 01:21:42 Failed to validate candidate para_id=Id(2032) error=Invalid(WorkerReportedInvalid("execute: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed\nWASM backtrace:\nerror while executing at wasm backtrace:\n    0: 0x75b2ad - <unknown>!<wasm function 9243>\n    1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847\n    2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc\n    3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9\n    4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7\n    5: 0x4f605a - <unknown>!validate_block\n    6: 0x7582d3 - <unknown>!<wasm function 8889>"))
Aug 16 01:21:42 n36bc68 polkadot[2629030]: 2024-08-16 01:21:42 Detected invalid candidate as an approval checker. reason=ExecutionError("execute: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed\nWASM backtrace:\nerror while executing at wasm backtrace:\n    0: 0x75b2ad - <unknown>!<wasm function 9243>\n    1: 0x3a2152 - <unknown>!trie_db::trie_codec::decode_compact_from_iter::hb98b612c8c36d847\n    2: 0x410291 - <unknown>!sp_trie::trie_codec::decode_compact::h45c1a6490d0f8abc\n    3: 0x3a1b27 - <unknown>!sp_trie::storage_proof::CompactProof::to_memory_db::h92908fb2b5a194e9\n    4: 0x620d73 - <unknown>!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::hca407a89dd5736f7\n    5: 0x4f605a - <unknown>!validate_block\n    6: 0x7582d3 - <unknown>!<wasm function 8889>") candidate_hash=0xb745185b6df2efc515ad8ee0b5551afe2f449592a3a110584eb862e7ebc92b98 para_id=Id(2032) traceID=243607485733499342685943594883350862590
Aug 16 01:21:44 n36bc68 polkadot[2629030]: 2024-08-16 01:21:44 💤 Idle (40 peers), best: #22109111 (0xa153…0270), finalized #22109108 (0x415f…ae66), ⬇ 1.1MiB/s ⬆ 866.4kiB/s

carpediem.txt

Point of contact: https://matrix.to/#/!NZrbtteFeqYKCUGQtr:matrix.parity.io/$17240809840JLZsS:parity.io?via=parity.io&via=corepaper.org&via=matrix.org.

Related to: https://github.com/paritytech/project-mythical/issues/213

bkchr commented 2 months ago

Looks like the PoV is invalid?

alexggh commented 2 months ago

These are included candidates and the node raising the dispute because of this reason was the only one voting invalid, so I'm not sure how the PoV can be invalid.

Jsdiem commented 2 months ago

Hello, here are some more data that you requested: Version at that time was 1.15.0, right now it is 1.15.1. Hardware specs are dedicated server with 6c/12t CPU, 64 GB of RAM and 2 x 1 TB NVME. Bare metal with only Polkadot running on it, didn't have any issues before. Command line I'm using to start a node is: ExecStart=/home/admin/polkadot-sdk/target/release/polkadot --validator --name "Carpediem" --chain=polkadot --database=paritydb --sync=warp --pruning=1000 --telemetry-url 'wss://telemetry-backend.w3f.community/submit/ 1' I have a script running on the node that restarts polkadot every 60 hours, but it seems that it didn't help and that node came back to normal itself. You can check the attached log. out.zip

alexggh commented 2 months ago

From the logs the blocks that triggered the disputes, I don't see anything unsual with them.

First: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-interlay.luckyfriday.io%2F#/explorer/query/5913045 Last: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-interlay.luckyfriday.io%2F#/explorer/query/0xe474e99c1f2ce25b0bfe89eb6e6ee673ede773d336bd6b5186080d5b4bd346bf

alexggh commented 2 months ago

Look a bit more closer on the provided logs, some timeline:

  1. Node starts out of the sudden to fail validation for all block on parachain 2032

    Aug 16 00:43:07 n36bc68 polkadot[2629030]: 2024-08-16 00:43:06 Failed to validate candidate para_id=Id(2032) error=Invalid(WorkerReportedInvalid(
  2. Last validation failure:

    Aug 16 11:45:48 n36bc68 polkadot[2629030]: 2024-08-16 11:45:48 Failed to validate candidate para_id=Id(2032) 
  3. Node gets restarted because of this:

    Aug 16 11:49:20 n36bc68 polkadot[2629030]: 2024-08-16 11:49:20 🥩 Error: RequestsReceiverStreamClosed. Terminating.
    Aug 16 11:49:21 n36bc68 polkadot[2629030]: 2024-08-16 11:49:21 Essential task `beefy-gadget` failed. Shutting down service.
    Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 Parity Polkadot
    Aug 16 11:49:25 n36bc68 polkadot[2794002]: 2024-08-16 11:49:25 ✌️  version 1.15.0-743dc632fd6
  4. Node recovered and does not fail to validate parachain 2032

Possible root-cause

The node actually recovered because of a restart and given that after restart all PVFs get recompiled I tend to think the problem might have been caused by a corruption of the PVF artefact for parachain 2032 which then lead it to fail validation on all candidates for that parachain.

alexggh commented 2 months ago

With a zombienet I managed to get the validator into a state where it continuously disputes candidates for a given parachain by manually corrupting its PVF artefact, so from the data we got I'm inclined to think that's what happened here.

Also, I'm not sure there is any new data that we can obtained at this moment, because once the node restarts the artefacts are gone, so the next steps here would be:

  1. Mitigate and address this corruption possibility with https://github.com/paritytech/polkadot-sdk/issues/5441 or something similar.

  2. @Jsdiem keep an eye on your node, if you seen this error coming back, although if the theory is right it is unlikely it will, please archive parachains and pvf-artifacts folders from the database path and share them with us.

alexggh commented 2 months ago

Closing this mitigation should be implemented with: https://github.com/paritytech/polkadot-sdk/issues/5441.