Closed Mihajlo-Pavlovic closed 3 months ago
Hi,
I investigated this issue a bit today. From the collator side everything seems to be okay. However, for some reason the submitted blocks fail to validate on the rococo validators.
I extracted this from one validators log:
2024-08-06 18:09:12.752 INFO tokio-runtime-worker parachain::candidate-validation: Failed to validate candidate para_id=Id(2043) error=Invalid(WorkerReportedInvalid("execute: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
WASM backtrace:
error while executing at wasm backtrace:
0: 0x620fb6 - neuroweb_runtime.wasm!rust_begin_unwind
1: 0xc76a - neuroweb_runtime.wasm!core::panicking::panic_fmt::h8b4135fb81936008
2: 0x3466d3 - neuroweb_runtime.wasm!core::panicking::panic_display::h287c324f965c61f8
3: 0x346147 - neuroweb_runtime.wasm!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::apply_extrinsics::{{closure}}::panic_cold_display::h87e29365917e1814
4: 0x34410c - neuroweb_runtime.wasm!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::execute_block::h307560923d65cd10
5: 0x594ddd - neuroweb_runtime.wasm!<cumulus_pallet_aura_ext::BlockExecutor<T,I> as frame_support::traits::misc::ExecuteBlock<Block>>::execute_block::h17a9647f9efaccce
6: 0x1a0440 - neuroweb_runtime.wasm!environmental::local_key::LocalKey<T>::with::hce9d98b3bbf43917
7: 0x29dbcf - neuroweb_runtime.wasm!cumulus_pallet_parachain_system::validate_block::implementation::validate_block::h69a5a40d817377d2
8: 0x4a6723 - neuroweb_runtime.wasm!validate_block
9: 0x6d94dd - neuroweb_runtime.wasm!<wasm function 8475>"))
What the reason for this validation failure is exactly I am not sure. Maybe something went wrong with the runtime upgrade? When was it deployed?
Runtime was deployed on 2024-08-05 at 10:25:12 (+UTC), this is the block when the runtime upgrade is applied: https://neuroweb-testnet.subscan.io/block/4238101
You can see that 30 blocks were created after the runtime upgrade was applied: https://neuroweb-testnet.subscan.io/block/4238136, so the upgrade itself had no issue as block production continued after the upgrade.
We are considering changing runtime to the previous version using sudo on Rococo to unblock the network, do you think this would help us?
This seems to be the reason the parachain block fails to validate:
2024-08-05 11:07:48.598 DEBUG tokio-runtime-worker parachain::pvf:
starting execute for /chain-data/chains/rococo_v2_2/paritydb/full/pvf-artifacts/0x60d3f38c17a14de886f39484e258e3e277d4e2b9392a7c20f050dfe73678925eab1395170e9320bdc5bf7f51164c17244852168e1da4b722918b52505e366264.pvf
worker_pid=95 worker_dir=WorkerDir { tempdir: TempDir { path: "/chain-data/chains/rococo_v2_2/paritydb/full/pvf-artifacts/worker-dir-execute-wY8kO1" } }
validation_code_hash=0x102932e694d206e3f68f2b9e84e1f3a619f482965d22f4ae61ba98f9b065e2ae 2024-08-
05T11:07:48.660380Z ERROR runtime: panicked at
/Users/nikolatodorovic/.cargo/git/checkouts/polkadot-sdk-cff69157b985ed76/5641e18/substrate/frame/executive/src/lib.rs:693:17:
InvalidTransaction custom error
--
Maybe it helps you figure it out why the blocks are invalid.
Runtime was deployed on 2024-08-05 at 10:25:12 (+UTC), this is the block when the runtime upgrade is applied: https://neuroweb-testnet.subscan.io/block/4238101
You can see that 30 blocks were created after the runtime upgrade was applied: https://neuroweb-testnet.subscan.io/block/4238136, so the upgrade itself had no issue as block production continued after the upgrade.
As far as I can tell, everything was fine until https://neuroweb-testnet.subscan.io/block/4238137 which is appears unfinalized in subscan - latest para head on Rococo relay chain is of block https://neuroweb-testnet.subscan.io/block/4238136: 0xf1f71e7d498dbecd671c4a985199398f9d87fb380944184aa594251094b516cb
It seemed to have worked fine as long as all the para blocks were empty after the upgrade until 4238137 which contains a bunch of ethereum:transaction:TransactionV2
extrinsics: https://neuroweb-testnet.subscan.io/extrinsic/4238137-2
We are considering changing runtime to the previous version using sudo on Rococo to unblock the network, do you think this would help us?
Yeah setting the old code and head should work.
One more thing to try is to make sure that you are not using the native executor to build your blocks. Recently we had reports of nodes stalling because there was some encoding difference between wasm and native: https://github.com/paritytech/polkadot-sdk/issues/4808
The solution in that case was to switch to full wasm execution, which you should also do.
If that does not help, next step would be to reproduce this error locally. Recently there was mechanisms introduced to write PoVs to the disk and then execute them locally. It has two parts:
The debugging flow would be the following:
cargo install --path ./cumulus/bin/pov-validator --locked
in the polkadot-sdkpov-validator
with your runtime.Thank you all for your swift replies!
@skunert to your point, I believe the testnet version is still using native executor. I can also point out to this PR on HydraDX that changed that as well.
Still, given that the fix will change a bit of time and the team is in need to swiftly get their parachain back to production, I'd suggest we go with the suggestion of setting up the old code that @sandreim said it should work. The questions that I have given that the chain managed to produce ~30 blocks with the new code, are:
@SBalaguer You should be able to just pass --execution=wasm
should achieve the same for now, without any changes.
Ah missed your questions the first time. I have only limited experience in resetting chains. But from my mental model:
@SBalaguer You should be able to just pass
--execution=wasm
should achieve the same for now, without any changes.
When setting the parameter we get CLI parameter --execution has no effect anymore and will be removed in the future
, and when trying to use older binary we cannot start the node because of the following error:
2024-08-07 12:36:04 [Parachain] Cannot create a runtime error=Other("runtime requires function imports which are not present on the host: 'env:ext_storage_proof_size_storage_proof_size_version_1'")
2024-08-07 12:36:04 [Parachain] Essential task `transaction-pool-task-0` failed. Shutting down service.
2024-08-07 12:36:04 [Parachain] Essential task `transaction-pool-task-1` failed. Shutting down service.
2024-08-07 12:36:04 [Parachain] Essential task `txpool-background` failed. Shutting down service.
Error: Service(Client(RuntimeApiError(Application(VersionInvalid("Other error happened while constructing the runtime: runtime requires function imports which are not present on the host: 'env:ext_storage_proof_size_storage_proof_size_version_1'")))))
So I guess the only option is to change the code.
Should we set the head of the chain to the point before the upgrade happened? Upgrade happened at block #4238062.
When you do this, all the nodes will need to resync because the blocks after the runtime got upgraded are already finalized. We don't support to revert the finalized chain.
Without any more details, I assume as well that there is a mismatch between the native and the runtime registered on the relay chain. Just remove the native executor to test this theory. It should be the fastest option. If that doesn't work, you can still reset.
There should also be a .disable_use_native()
method on NativeExecutor. But yeah best is to not use it at all.
Just to confirm, we should remove the native executor in a same way as it is done on the HydraDX PR. And just change the binaries to run the new ones and check if that resolves the issue.
Just to confirm, we should remove the native executor in a same way as it is done on the HydraDX PR. And just change the binaries to run the new ones and check if that resolves the issue.
That is correct.
After you switched to the new binaries, the collators should start building blocks again. This time however you used WASM to build so there should be no difference to the validation code the validators are running. Blocks should be backed by the relay chain again and you chain will progress.
Thanks for the help guys, this was indeed the solution and collators are now producing blocks on NeuroWeb Testnet!
That is great to hear! Will close here, if they get stuck again (which I don't expect) feel free to reopen.
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
At NeuroWeb we are experiencing an issue with NeuroWeb Testnet connected to the Rococo relay chain.
As per @skunert request, we are opening a new issue. Initially, we reported this to issue #1202.
After runtime upgrade where we changed dependencies from v0.9.40 to v1.9.0 (https://github.com/OriginTrail/neuroweb/pull/86), upgrade was successful but after 30+ blocks, the chain could not produce a block.
All collators have block 4238138 as their best block and 4238137 as finalized, but they are constantly trying to create 4238138 block again and are stuck in a loop. With following logs, with
aura::cumulus=trace
,parachain=debug
,txpool=debug
:We assume that it is caused by forks on relay chain (Rococo) which are not handled properly by collators on NeuroWeb Testnet. But we would like to get confirmation and better understanding of this issue, so we are sure how to avoid having it on NeuroWeb Mainnet, alongside finding the way to unblock NeuroWeb Testnet.
Steps to reproduce
No response