Closed chrisdcosta closed 1 year ago
To solve your problem, you can use the code_substitutes
field in your chain spec. You will need to use the same commit that you used to create the current wasm file that is on chain. Then you only modify the code to remove the call to random_seed
. The resulting wasm file you have to put into your chain spec as code substitute. The block number can be some recent one. Then you will need to go to the council and ask for force overwriting your validation code to the exact same file that you also have put into your chain spec. After this happened on the relay chain, you can send your upgrade call.
@bkchr Thanks.
We have staging on Rococo, so I presume we can do a pre-production test there as this is identical code to production?
Since the fix lasts only as long as the next spec_version
(which would change by normal upgrade) I suppose that we should not increment but instead use the current on-chain spec_version
when we create the new wasm code? Is that correct?
increment but instead use the current on-chain
spec_version
when we create the new wasm code
If you speak about the one that you will put into the chain spec, then yes! The spec version in the code_substitute
should be the same as you currently use on chain.
Yes that is what I meant.
@bkchr one last question. We are running some nodes with docker, can we update them before using the new raw chainspec even though the force overwriting your validation code is not enacted yet? I suppose that makes no difference as the collators will only obey Polkadot? Or do we have to wait until the overwriting has been done?
That second option doesn't make any sense to me, but I thought I'd ask.
...and is the format like this?
"codeSubstitutes": {
"blocknumber": "0x52bc5376....."
}
Changes applied on this branch
for Lego (local rococo), and Kapex (Polkadot).
@bkchr one last question. We are running some nodes with docker, can we update them before using the new raw chainspec even though the force overwriting your validation code is not enacted yet? I suppose that makes no difference as the collators will only obey Polkadot? Or do we have to wait until the overwriting has been done?
That second option doesn't make any sense to me, but I thought I'd ask.
...and is the format like this?
"codeSubstitutes": { "blocknumber": "0x52bc5376....." }
blocknumber
should be a number and not the word blocknumber
. You can already use the new chain spec. As long as you will not send any transaction that runs the problematic code path it will work. Otherwise you will get an error on the relay chain while validating your Parachain block, because the relay chain still uses the old runtime.
@bkchr yes I knew that its was the actual block number that was needed and not the word! The update in the branch where changes were applied used recent block numbers from the 3 networks mentioned.
Whilst i think of it, I put recent parachain blocknumbers - perhaps it should have been a recent relaychain blocknumber?
Again on this point obviously this blocknumber (whichever is correct) will become old and not "recent" by the time the council performs the force overwrite. Does this matter? Should we consider putting a block number in the near future that is likely to be surpassed before the force overwrite so that it would be a "recent" block number?
I have so many questions - apologies because it's not entirely clear for me.
Blocknumber of the Parachain. Just use some recent block number from of your Parachain.
@bkchr could you just review what I did here please.
1) Created a branch from the commit used to create the genesis state applied to Polkadot for the current chain.
2) Commented out the offending line.
3) Recompiled code, and extracted the new chainspec-for-code-sub-wasm.json
, and compared it to the original chainspec-for-wasm.json
extract to check that it was indeed different.
4) made copies of the Lego (Rococo Local Dev Parachain) and Kapex (Production) readable chainspecs, and added the code from chainspec-for-code-sub-wasm.json
into codeSubstitutes:
Note: The chainspec update for Stagex (Public Rococo) will be applied once this branch is merged with Main.
5) Converted the readable specs to raw, renaming original chain spec to "original".
The steps seem right to me, everything follows the guidelines that have been shred. In case you want to double check, be sure that point 1 is using the right commit, I am positive that it is the case. But be sure as if that is not the case, then this won't work as expected :)
If anything, I would like to double click on the BlockNumber
needed for Stagex chainspec, at the moment of writing your height is around #81128
( this will be higher at the moment of reading ). Just make sure you set this right.
The update was performed on Rococo Stagex this morning and was successful. The KAPEX collators are now running the Docker image used for Staging which has migrated to Production. The UI nodes will be migrated shortly.
Good to hear @chrisdcosta!
@bkchr we are still not clear on the next steps and haven't got a clear answer on element.
Is the account that uploaded the parachain code to Polkadot able to execute the paras.forceSetCurrentCode()
on Polkadot or not?
Is the account that uploaded the parachain code to Polkadot able to execute the
paras.forceSetCurrentCode()
on Polkadot or not?
No, it is not. You need to go to the council to get a referendum for doing this as you need root origin for this.
@bkchr although it appears that the upgrade was successful (i.e. we did not get the runtime error when doing authorizeUpgrade()
and enactUpgrade()
in our Staging environment) we discovered that on attempting a further upgrade the storage parachainSystem.pendingValidationCode()
appears not to have been cleared.
The impact is that this prevents execution of enactUpgrade()
for any subsequent upgrade with the error overlappingUpgrades
:
We are of course able to change authorizeUpgrade()
at will.
We are not sure how this could have happened. The error emanates from the Cumulus side of the processing. Any ideas?
We reproduced this in our development relaychain environment as well.
ParaID 2007(Stagex) 2000(Lego)
Both parachain-code-substitution-wasm.wasm
and the hash here runtime_fix_v110.compact.compressed.wasm
are identical by hash.
This is the hash for runtime_fix_v110.compact.compressed.wasm
[
0x1c0c15c2070eac5039736140b657254a09b5d6570860453d07a918c01f02c103
]
Rococo local:
Check the current Validation code by hash: paras.codeByHash()
on Rococo Local Dev Network (only one parachain)
[
[
[
0x1c0c15c2070eac5039736140b657254a09b5d6570860453d07a918c01f02c103
]
0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
]
]
This result agrees with the result on Rococo public.
[
0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
]
Both the local dev Relaychain network and Rococo public are in the same state.
The parachain Stagex has an issue because the storage parachainSystem.pendingValidationCode()
contains the validation code here instead of being empty. This will trigger and 'Overlapping Upgrade' error when any subsequent upgrade occurs.
We have set up a fresh relaychain/parachain to reproduce the error.
It uses Rococo Local and our parachain collator docker image totemlive/totem-parachain-collator:test-codeSubstitute
. This node uses our own sudo key.
Rococo Local Relaychain has been set with paras.forceSetCurrentCode()
for this parachain, as confirmed above and the code substitute occurs at block 50 on the Lego Parachain.
Inspect the state of the parachainSystem.pendingValidationCode()
at each step. The current state is:
[
0x
]
Using the sudo user we execute this on the parachain: sudo.sudoCall(paras.authorizeUpgrade(<parachain-code-substitution-wasm.wasm>))
(pulling the wasm file directly from a local directory)
Check parachainSystem.pendingValidationCode()
after transaction:
[
0x
]
Using the sudo user we execute this on the parachain: sudo.sudoCall(paras.enactUpgrade(<parachain-code-substitution-wasm.wasm>))
Check parachainSystem.pendingValidationCode()
after transaction:
[
0x52bc537646db8e0528b52ffd0058848704ee91c56211501050cc281dc01bbdb01e7ec2ad3947233a21f5ec02c58da344e3b2be55279765528ae5e0059f5ca2cd...207970a7801
]
Now attempting a further upgrade on the parachain: sudo.sudoCall(paras.authorizeUpgrade(<v120_compact_compressed.wasm>))
(a new upgrade pulling the wasm file directly from a local directory)
[
0x786233b426f99e8101c4881976d11f4328d0840f6f716cee41502b56762c9d86
]
This is successful - setting the new expected hash of the coming upgrade.
Check parachainSystem.authorizedUpgrade()
after transaction:
[
0x786233b426f99e8101c4881976d11f4328d0840f6f716cee41502b56762c9d86
]
However, when enacting the upgrade on the parachain: sudo.sudoCall(paras.enactUpgrade(<v120_compact_compressed.wasm>))
fails due to OverlappingUpgrades
error in Cumulus.
Parachain id 2007
(Stagex on Rococo Public, Kapex on Polkadot)
Parachain id 2000
(Lego on Totem Rococo local)
All the following files are in the current Main branch of totem-parachains
Build the runtime from scratch cargo build --release -p totem-parachain-node
. This will include chainspecs so that you can start an individual chain using the arguments chain=stagex
or chain=kapex
.
Current production docker image totemlive/totem-parachain-collator:v1.1.0-codeSubstitute is aligned with the main branch build, including the chainspecs containing the codeSubtitutes mentioned here.
To run the public Rococo version with Stagex you will need to specify the rococo v2 spec added manually into the chain
argument because the polkadot version used here (v0.9.18) was before it was included. This issue is blocking from upgrading to the later version.
Genesis WASM Lego, Stagex and Kapex
Kapex and Lego wasm and state build
Lego original chainspec Genesis State Lego
Kapex original chainspec Genesis State Kapex
Stagex wasm and state build Stagex original raw chainspec Genesis State Stagex
New Validation code wasm for Code Substitution Stagex updated code subtitutes chainspec Kapex updated code subtitutes chainspec - placeholder
There is something I think I could be misunderstanding.
You are executing both
sudo.sudoCall(paras.authorizeUpgrade(<parachain-code-substitution-wasm.wasm>))
sudo.sudoCall(paras.enactUpgrade(<parachain-code-substitution-wasm.wasm>))
btw this call doesn't require root origin
But at that point in time, the code substitution should have already happened, right ?
You have already set the wasm on your chainspec, then waited until the block number you chose, and then executed paras.forceSetCurrentCode()
on the relay with the same wasm that was used on codeSubstitute
.
So why trying to do a runtime upgrade with the same wasm ? Is it not the same one and I am just looking at this the wrong way ?
@al3mart no that was not the procedure that was followed.
I followed the procedure mentioned above in this comment, specifically:
The blocknumber was not set to some future block number - it was set to "some recent blocknumber" meaning a blocknumber that had already passed.
"after that you can send your upgrade call", which I understood to be the upgrade to this new validation code that was used in the paras.forceSetCurrentCode()
What you appear to be saying is that it is a much simpler case:
paras.forceSetCurrentCode()
on the relay chain is all that is required on condition that
the parachain collators are already aware of the new validation code via the chainspec codeSubstitution
the parachain block number on which that changeover occurs is in the past relative to the relaychain blocknumber at which paras.forceSetCurrentCode()
is executed.
the parachain blocknumber selected should actually be a little bit in the future - enough time to update the running nodes.
Then any normal upgrade can occur whereby the spec version is incremented as usual.
codeSubstitution
.paras.forceSetCurrentCode()
on the relay chain with the same code as in 1.Sorry if my comment was confusing. Please, try again as per the last comment instructions and let us know how that went, and if you are observing the same behavior than before.
Background
The SR Labs audit on the Totem Accounting Parachain Pallets highlighted an issue with the illegal use of
sp_io::offchain::random_seed()
in the Accounting Pallet.The audit was conducted after the code head had been uploaded to Polkadot, and after the Kapex Parachain had won a parachain slot.
Our initial reaction was that we could fix this bug with an upgrade.
The bug does not affect block production or collation for the Kapex Parachain, however it does prevent the ability to upgrade the network as described here. The upgrade itself fails because it calls the offending function.
The bug
In a previous iteration the code had been commented, however that comment was lifted in the days before the final head and wasm blob were created for reasons that are lost in the mists of time in April 2022, possibly after transactions testing had already completed.
The severity of the issue was not realised until the audit. It was clear the code was illegal and so corrections were made to the code fixing the bug. This allowed the audit to continue and has now been applied to this branch which was expected to be the first upgrade.
The upgrade process
During the
Call
toparachainSystem -> authorizeUpgrade
theTransaction Payment Pallet
is called to reserve a fee.As the
Transaction Payment Pallet
is non-standard (adapted to record the fee information into the accounting), this call unfortunately reaches the illegal code and would otherwise cause a panic.Conclusion
The
authorizeUpgrade
extrinsic cannot be used to alert the Relaychain to the upcoming upgrade andenactAuthorizedUpgrade
would also fail for the same reason.The Kapex chain whilst producing blocks, is still in its genesis state, due to the existence of a transaction lock which was put in place to ensure that upgrades could be performed before "users were allowed in".
The issue implies a hard fork (new head and wasm blob) to solve.