totem-tech / totem-parachains

Totem Parachain based on Substrate Parachain Template
https://totemaccounting.com
The Unlicense
3 stars 0 forks source link

Kapex Upgrade fails due to bug in Totem fork of Transaction Payment Pallet #5

Closed chrisdcosta closed 1 year ago

chrisdcosta commented 2 years ago

Background

The SR Labs audit on the Totem Accounting Parachain Pallets highlighted an issue with the illegal use of sp_io::offchain::random_seed() in the Accounting Pallet.

The audit was conducted after the code head had been uploaded to Polkadot, and after the Kapex Parachain had won a parachain slot.

Our initial reaction was that we could fix this bug with an upgrade.

The bug does not affect block production or collation for the Kapex Parachain, however it does prevent the ability to upgrade the network as described here. The upgrade itself fails because it calls the offending function.

authorizeUpgrade-extrinisic-error

The bug

In a previous iteration the code had been commented, however that comment was lifted in the days before the final head and wasm blob were created for reasons that are lost in the mists of time in April 2022, possibly after transactions testing had already completed.

The severity of the issue was not realised until the audit. It was clear the code was illegal and so corrections were made to the code fixing the bug. This allowed the audit to continue and has now been applied to this branch which was expected to be the first upgrade.

The upgrade process

During the Call to parachainSystem -> authorizeUpgrade the Transaction Payment Pallet is called to reserve a fee.

As the Transaction Payment Pallet is non-standard (adapted to record the fee information into the accounting), this call unfortunately reaches the illegal code and would otherwise cause a panic.

Conclusion

The authorizeUpgradeextrinsic cannot be used to alert the Relaychain to the upcoming upgrade and enactAuthorizedUpgrade would also fail for the same reason.

The Kapex chain whilst producing blocks, is still in its genesis state, due to the existence of a transaction lock which was put in place to ensure that upgrades could be performed before "users were allowed in".

The issue implies a hard fork (new head and wasm blob) to solve.

bkchr commented 2 years ago

To solve your problem, you can use the code_substitutes field in your chain spec. You will need to use the same commit that you used to create the current wasm file that is on chain. Then you only modify the code to remove the call to random_seed. The resulting wasm file you have to put into your chain spec as code substitute. The block number can be some recent one. Then you will need to go to the council and ask for force overwriting your validation code to the exact same file that you also have put into your chain spec. After this happened on the relay chain, you can send your upgrade call.

chrisdcosta commented 2 years ago

@bkchr Thanks.

We have staging on Rococo, so I presume we can do a pre-production test there as this is identical code to production?

Since the fix lasts only as long as the next spec_version (which would change by normal upgrade) I suppose that we should not increment but instead use the current on-chain spec_version when we create the new wasm code? Is that correct?

bkchr commented 2 years ago

increment but instead use the current on-chain spec_version when we create the new wasm code

If you speak about the one that you will put into the chain spec, then yes! The spec version in the code_substitute should be the same as you currently use on chain.

chrisdcosta commented 2 years ago

Yes that is what I meant.

chrisdcosta commented 2 years ago

@bkchr one last question. We are running some nodes with docker, can we update them before using the new raw chainspec even though the force overwriting your validation code is not enacted yet? I suppose that makes no difference as the collators will only obey Polkadot? Or do we have to wait until the overwriting has been done?

That second option doesn't make any sense to me, but I thought I'd ask.

...and is the format like this?

"codeSubstitutes": {
    "blocknumber": "0x52bc5376....."
}
chrisdcosta commented 2 years ago

Changes applied on this branch

for Lego (local rococo), and Kapex (Polkadot).

bkchr commented 2 years ago

@bkchr one last question. We are running some nodes with docker, can we update them before using the new raw chainspec even though the force overwriting your validation code is not enacted yet? I suppose that makes no difference as the collators will only obey Polkadot? Or do we have to wait until the overwriting has been done?

That second option doesn't make any sense to me, but I thought I'd ask.

...and is the format like this?

"codeSubstitutes": {
    "blocknumber": "0x52bc5376....."
}

blocknumber should be a number and not the word blocknumber. You can already use the new chain spec. As long as you will not send any transaction that runs the problematic code path it will work. Otherwise you will get an error on the relay chain while validating your Parachain block, because the relay chain still uses the old runtime.

chrisdcosta commented 2 years ago

@bkchr yes I knew that its was the actual block number that was needed and not the word! The update in the branch where changes were applied used recent block numbers from the 3 networks mentioned.

Whilst i think of it, I put recent parachain blocknumbers - perhaps it should have been a recent relaychain blocknumber?

Again on this point obviously this blocknumber (whichever is correct) will become old and not "recent" by the time the council performs the force overwrite. Does this matter? Should we consider putting a block number in the near future that is likely to be surpassed before the force overwrite so that it would be a "recent" block number?

I have so many questions - apologies because it's not entirely clear for me.

bkchr commented 2 years ago

Blocknumber of the Parachain. Just use some recent block number from of your Parachain.

chrisdcosta commented 2 years ago

@bkchr could you just review what I did here please.

1) Created a branch from the commit used to create the genesis state applied to Polkadot for the current chain.

2) Commented out the offending line.

3) Recompiled code, and extracted the new chainspec-for-code-sub-wasm.json, and compared it to the original chainspec-for-wasm.json extract to check that it was indeed different.

4) made copies of the Lego (Rococo Local Dev Parachain) and Kapex (Production) readable chainspecs, and added the code from chainspec-for-code-sub-wasm.json into codeSubstitutes:

Note: The chainspec update for Stagex (Public Rococo) will be applied once this branch is merged with Main.

5) Converted the readable specs to raw, renaming original chain spec to "original".

al3mart commented 2 years ago

The steps seem right to me, everything follows the guidelines that have been shred. In case you want to double check, be sure that point 1 is using the right commit, I am positive that it is the case. But be sure as if that is not the case, then this won't work as expected :)

If anything, I would like to double click on the BlockNumber needed for Stagex chainspec, at the moment of writing your height is around #81128 ( this will be higher at the moment of reading ). Just make sure you set this right.

chrisdcosta commented 2 years ago

The upgrade was successful on Rococo local. We used the codeSubstitutes file here and the upgraded the network with the file here.

The netxt test will be performed on Rococo/Stagex network with the same files.

chrisdcosta commented 2 years ago

The update was performed on Rococo Stagex this morning and was successful. The KAPEX collators are now running the Docker image used for Staging which has migrated to Production. The UI nodes will be migrated shortly.

bkchr commented 2 years ago

Good to hear @chrisdcosta!

chrisdcosta commented 2 years ago

@bkchr we are still not clear on the next steps and haven't got a clear answer on element.

Is the account that uploaded the parachain code to Polkadot able to execute the paras.forceSetCurrentCode() on Polkadot or not?

bkchr commented 2 years ago

Is the account that uploaded the parachain code to Polkadot able to execute the paras.forceSetCurrentCode() on Polkadot or not?

No, it is not. You need to go to the council to get a referendum for doing this as you need root origin for this.

chrisdcosta commented 2 years ago

@bkchr although it appears that the upgrade was successful (i.e. we did not get the runtime error when doing authorizeUpgrade() and enactUpgrade() in our Staging environment) we discovered that on attempting a further upgrade the storage parachainSystem.pendingValidationCode() appears not to have been cleared.

Stagex on PolkadotJS Apps

The impact is that this prevents execution of enactUpgrade() for any subsequent upgrade with the error overlappingUpgrades:

image

We are of course able to change authorizeUpgrade() at will.

We are not sure how this could have happened. The error emanates from the Cumulus side of the processing. Any ideas?

chrisdcosta commented 2 years ago

We reproduced this in our development relaychain environment as well.

Data

This is the hash for runtime_fix_v110.compact.compressed.wasm

[
    0x1c0c15c2070eac5039736140b657254a09b5d6570860453d07a918c01f02c103
]

Relay chain state:

Rococo local: Check the current Validation code by hash: paras.codeByHash() on Rococo Local Dev Network (only one parachain)

[
    [
        [
        0x1c0c15c2070eac5039736140b657254a09b5d6570860453d07a918c01f02c103
        ]
        0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
    ]
]

This result agrees with the result on Rococo public.

[
    0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
]

Both the local dev Relaychain network and Rococo public are in the same state.

The parachain Stagex has an issue because the storage parachainSystem.pendingValidationCode() contains the validation code here instead of being empty. This will trigger and 'Overlapping Upgrade' error when any subsequent upgrade occurs.

Steps to reproduce:

We have set up a fresh relaychain/parachain to reproduce the error.

It uses Rococo Local and our parachain collator docker image totemlive/totem-parachain-collator:test-codeSubstitute. This node uses our own sudo key.

Rococo Local Relaychain has been set with paras.forceSetCurrentCode() for this parachain, as confirmed above and the code substitute occurs at block 50 on the Lego Parachain.

Inspect the state of the parachainSystem.pendingValidationCode() at each step. The current state is:

[
    0x
]

Using the sudo user we execute this on the parachain: sudo.sudoCall(paras.authorizeUpgrade(<parachain-code-substitution-wasm.wasm>)) (pulling the wasm file directly from a local directory)

Check parachainSystem.pendingValidationCode() after transaction:

[
    0x
]

Using the sudo user we execute this on the parachain: sudo.sudoCall(paras.enactUpgrade(<parachain-code-substitution-wasm.wasm>))

Check parachainSystem.pendingValidationCode() after transaction:

[
    0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
]

Now attempting a further upgrade on the parachain: sudo.sudoCall(paras.authorizeUpgrade(<v120_compact_compressed.wasm>)) (a new upgrade pulling the wasm file directly from a local directory)

[
    0x786233b426f99e8101c4881976d11f4328d0840f6f716cee41502b56762c9d86
]

This is successful - setting the new expected hash of the coming upgrade.

Check parachainSystem.authorizedUpgrade() after transaction:

[
    0x786233b426f99e8101c4881976d11f4328d0840f6f716cee41502b56762c9d86
]

However, when enacting the upgrade on the parachain: sudo.sudoCall(paras.enactUpgrade(<v120_compact_compressed.wasm>)) fails due to OverlappingUpgrades error in Cumulus.

chrisdcosta commented 2 years ago

Detail on the files

All the following files are in the current Main branch of totem-parachains

Build the runtime from scratch cargo build --release -p totem-parachain-node. This will include chainspecs so that you can start an individual chain using the arguments chain=stagex or chain=kapex .

Docker

Current production docker image totemlive/totem-parachain-collator:v1.1.0-codeSubstitute is aligned with the main branch build, including the chainspecs containing the codeSubtitutes mentioned here.

Rococo / Stagex

To run the public Rococo version with Stagex you will need to specify the rococo v2 spec added manually into the chain argument because the polkadot version used here (v0.9.18) was before it was included. This issue is blocking from upgrading to the later version.

Specific file links (by commit)

Totem Rococo local chainspec

Common

Genesis WASM Lego, Stagex and Kapex

27 April LEGO

Kapex and Lego wasm and state build

Lego original chainspec Genesis State Lego

27 April KAPEX

Kapex original chainspec Genesis State Kapex

10 October STAGEX

Stagex wasm and state build Stagex original raw chainspec Genesis State Stagex

11 October New Validation Code

New Validation code wasm for Code Substitution Stagex updated code subtitutes chainspec Kapex updated code subtitutes chainspec - placeholder

al3mart commented 2 years ago

There is something I think I could be misunderstanding. You are executing both sudo.sudoCall(paras.authorizeUpgrade(<parachain-code-substitution-wasm.wasm>)) sudo.sudoCall(paras.enactUpgrade(<parachain-code-substitution-wasm.wasm>)) btw this call doesn't require root origin

But at that point in time, the code substitution should have already happened, right ? You have already set the wasm on your chainspec, then waited until the block number you chose, and then executed paras.forceSetCurrentCode() on the relay with the same wasm that was used on codeSubstitute.

So why trying to do a runtime upgrade with the same wasm ? Is it not the same one and I am just looking at this the wrong way ?

chrisdcosta commented 2 years ago

@al3mart no that was not the procedure that was followed.

I followed the procedure mentioned above in this comment, specifically:

What you appear to be saying is that it is a much simpler case:

bkchr commented 2 years ago
  1. Put the new validation code with a recent Parachain block number into codeSubstitution.
  2. Issue the paras.forceSetCurrentCode() on the relay chain with the same code as in 1.
  3. Do a normal runtime upgrade.
al3mart commented 2 years ago

Sorry if my comment was confusing. Please, try again as per the last comment instructions and let us know how that went, and if you are observing the same behavior than before.