paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 660 forks source link

Digest item must match that calculated #4808

Closed mustermeiszer closed 3 months ago

mustermeiszer commented 3 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Testnet chains are halting and not producing any blocks anymore.

We recently upgrade the development environments to release-polkadot-v1.7.2. Asyncronous backing is NOT activated. At first the chains were producing blocks just fine. But after submitting a certain transaction the block production stopped.

Relevant Logs



## Versions
Relay-chain: 
* `docker` : `parity/polkadot:v1.7.2`

Parachain:
* `polkadot-sdk` : `releases-polkadot-v1.7.2`

## Possibly Related Issues
Possibly related to: 
* https://github.com/paritytech/substrate/issues/9697
* https://github.com/paritytech/polkadot-sdk/issues/4588

### Steps to reproduce
Needs: Docker, rust-toolchain

* `git clone git@github.com:centrifuge/centrifuge-chain.git`
* `git checkout fix/release-0.11.0/v1`
* `./scripts/init.sh start-relay-chain`
* `./scripts/init.sh start-parachain purge`
* `./scripts/init.sh onboard-parachain`

The above will start 2 relay-chain nodes and one parachain node. The parachain node is afterwards automatically onboarded. 

In order to test the diggest issue you must: 
* Submit the following extrinsic on the parachain `0x6f00020000000000000000000000000000000000000000000000000000000000000000`
   * through polkadot-js: connect to `https://polkadot.js.org/apps/?rpc=ws%3A%2F%2F127.0.0.1%3A11936#/explorer`
* Look at both relay-chain logs, they will show the above `Digest Mismatch` error

To close down the setup 
* quit the terminal window that is running the parachain
* run `./scripts/init.sh stop-relay-chain`
bkchr commented 3 months ago

But after submitting a certain transaction the block production stopped.

What kind of transaction? Can you please link the code?

mustermeiszer commented 3 months ago

We think it is some recursion issue, but I would have thought that this would break the node differently... ^^

bkchr commented 3 months ago

We think it is some recursion issue

Can you be more specific?

Looking at your code, you don't deposit any digest item anywhere? Ahh, but you are using pallet-evm that is putting a digest into the header AFAIR?

mustermeiszer commented 3 months ago

Looking at your code, you don't deposit any digest item anywhere? Ahh, but you are using pallet-evm that is putting a digest into the header AFAIR?

That is correct. We are using that. The weird thing is, that it fails, even when no evm transactions are present...

mustermeiszer commented 3 months ago

Can you be more specific?

Not really. My current approach is to comment out certain code-paths, restart the chain, submit and see if it hangs.

bkchr commented 3 months ago

Hmm, really weird. Looking over the code, I don't really get what is going on. It looks good. Especially weird that this is triggered by a transaction.

If you comment out this line it works?

mustermeiszer commented 3 months ago

If you comment out [this line](https://github.com/centrifuge/centrifuge- chain/blob/4363f8bc3afd0ed04ff562587ef069de3fb7eb4e/pallets/oracle-feed/src/lib.rs#L128) it works?

Yes. It is also just happening on the relay-chain. The parachain happily produces the block without problems.

bkchr commented 3 months ago

The parachain happily produces the block without problems.

The block author is not re-running its own block. Do other parachain nodes are able to import the block?

mustermeiszer commented 3 months ago

Will they try without the relay-chain approving it? Haven't checked yet. Can try in the dev environments. Any logs to log out for?

I boiled it down to this code-path. Works without it, fails with it. It is rather a portion that is running there.

bkchr commented 3 months ago

So if you comment out the withdraw it works?

Will they try without the relay-chain approving it? Haven't checked yet. Can try in the dev environments. Any logs to log out for?

With Aura yes, the other nodes should import it. Just check if the other nodes manage to import the block.

How can I reproduce this? Can you tell me exactly what to launch etc? Then I could also look at it.

mustermeiszer commented 3 months ago

So if you comment out the withdraw it works?

No, just the fee.value::<Self>() and replace it with a BalanceOf::<T>::default().

mustermeiszer commented 3 months ago

How can I reproduce this? Can you tell me exactly what to launch etc? Then I could also look at it.

Give me 5 minutes I will adapt the issue for reproduction. Need a new branch as our launch script is out to date.

mustermeiszer commented 3 months ago

@bkchr updated. The above will build our node locally. I then simply comment out certain lines and re-do the procedure. We have integration test for that code-path, so it is really weird.

mustermeiszer commented 3 months ago

No, just the fee.value::() and replace it with a BalanceOf::::default().

I have to correct myself. The withdraw seems to be the real issue. Looking at the implementation it is short-circuiting for 0 values, which is provided by the BalanceOf::<T>::default(). Providing non-zero values, results in the errror.

mustermeiszer commented 3 months ago

In the non-local testnetworks the behaviour is a bit different.

At least this is my current understanding. Here are the log files. Both as csv and as json. The CSVs side by side can be seen here

dev-centrifuge-collator.csv dev-centrifuge-collator.json dev-centrifuge-fullnode.csv dev-centrifuge-fullnode.json dev-polkadot-validator-0.csv dev-polkadot-validator-0.json dev-polkadot-validator-1.csv dev-polkadot-validator-1.json

bkchr commented 3 months ago

@mustermeiszer just to keep you updated. I could reproduce this locally with your instructions. I already found out that it is related to the frontier digest. Looks like the state root is different. I'm continuing to find out on what is going on.

mustermeiszer commented 3 months ago

Thanks for the update!! Let me know if you need anything else.

bkchr commented 3 months ago

https://github.com/centrifuge/centrifuge-chain/pull/1881 the fix for your issue.

Underlying issue

    pub const fn size_of_feed<T: Config>() -> u32 {
        sp_std::mem::size_of::<(T::OracleKey, T::OracleValue, MomentOf<T>)>() as u32
    }

The function is using mem::size_of to calculate the size of some types. The problem is that the size is different in native versus wasm:

The problem with Centrifuge was now that the block production was running in native while the validation is running in wasm. The validation was finally failing because the storage root calculated by Frontier was different than the one calculated at block production. As Frontier is putting the ETH block hash into a digest, the validation failed at comparing the digests.

CC @mustermeiszer

mustermeiszer commented 3 months ago

The validation was finally failing because the storage root calculated by Frontier was different than the one calculated at block production. As Frontier is putting the ETH block hash into a digest, the validation failed at comparing the digests.

Does that mean, that without Frontier the block production would be able to work just fine? How would that be possible - shouldn't the storage root then still be different between PoV and what the relay computes?

bkchr commented 3 months ago

Does that mean, that without Frontier the block production would be able to work just fine?

No. Then it would have failed at comparing the storage_root. However, as the digests are compared first, it failed there first.