Update to Polkadot v1.1.0

gianfra-t commented 3 months ago

Issue #534.

Changes in Pallets

Removal of Header and BlockNumber in frame_system.
Replacement of type Index with Nonce.
Change the macro use for defining the Default trait of the GenesisConfig struct with a manual definition, to fix an error on the macro construct_runtime when using the previous #[derive(default)] macro.
Default trait has to be defined for all GenesisConfig structs, not just when std feature is enabled.
construct_runtime can be used without defining the components of the module. This is less error prone since hard to debug errors may appear when missing a component. (If we are okay with this change, I can later modify the rest).
Replace the use of the to-be-deprecated GenesisBuild.
New field in pallet_aura config. AllowMultipleBlocksPerSlot should be false if used in normal scenario (not augmented with a custom pallet).
New field in pallet_grandpa MaxNominators. The value is set at 1000, previously this was set by default as 200. This value is used for weight calculations.

Node changes

Changes made using as base the substrate-node-template from the developer hub, at a commit using version 1.0.0.

Most relevant change is how the off-chain worker is spawned. Here is the change for the template between the relevant version.

Other code changes

Modification of parameters of SubxtClientConfig .

rpc_max_response_size is defined as the default value, in megabytes.
rpc_max_request_size is deprecated and has no effect.

Subxt update

subxt had to be updated from 0.25.0 to 0.28.0 due to a conflict with schnorkell package. This is the lowest version that solves that issue. Nevertheless, this new version introduced some breaking changes that have to be handled:

flag substrate-compat has to be used now for many of the functions and exports regarding substrate compatibility.
Pallet and error information of ModuleError are now accessed differently, which impacts this match.
Introduction of dynamic types.. This mostly affects the way in which we replace types define on the metadata with type defined in the code. Types that do not implement EncodeAsType and DecodeAsType now should be wrapped in a Static struct, which is the case with CurrencyId and all the corresponding uses of it on the code. To avoid wrapping as static the type AccountId, we use the AccountId type from subxt.
The last point introduces the need to transform from Keyring to subxt's AccountId, which is done now as in the package tests. Here is an example on our code.
Since the implementation of AccountId from subxt does not have some of the traits required, we need to transform to the sp version of it before using that functionality, a good example is here where we use need sp_runtime version for hashing.

gianfra-t commented 3 months ago

@pendulum-chain/devs keep in mind that for some reason the tests fail on the CI environment but locally they seem to be okay, so that needs to be solved first. If anyone has any idea about that Timeout please let me know!

TorstenStueber commented 3 months ago

Thanks for the detailed description.

construct_runtime can be used without defining the components of the module

I think this is fine as also the Polkadot relaychain runtime defines it that way.

gianfra-t commented 3 months ago

@TorstenStueber Thanks for the comments. Many of the changes you pointed out are temporary (like logs, update to the toolchain, etc) because there is an issue with one of the unit tests that is only failing on ubunutu (or linux in general as far as I could tell), but not in mac-os as I commented here. Sorry the comment was not detailed enough.

One of the tests in question is test_register_vault which I have been using for testing a solution. For now, it seems the solution is to manually finalize blocks to "unstuck" the test. It still requires a bit of exploration to define a more robust workaround.

gianfra-t commented 2 months ago

@pendulum-chain/devs Regarding the failing tests test_register_vaultand test_subxt_processing_events_after_dispatch_error, this is what I know so far:

They started to fail after the update, and failed only on linux based systems.
The failure mode is as follows: After the transaction is sent to the local testnet, we await for it to be included in the finalized block here. This sometime will wait forever, even when the testnet continues to work correctly (apparently, based on logs) and produced new blocks. The test naturally fails due to a Timeout error.
Updating subxt to different versions, up to current 0.33.0 didn't solve the issue.
Updating the dependencies of the node from polkadot-v1.1.0 to 1.5.0 or higher seems to make the error more uncommon for test_register_vault but more common for test_subxt_processing_events_after_dispatch_error. This was all tested locally on an ubuntu (pop-os) desktop. It continues to be random when they fail, so it is hard to make a statistic about the issue.

We know that this must be a local-only issue since integration test are working on both operating systems.

ebma commented 2 months ago

We know that this must be a local-only issue since integration test are working on both operating systems.

If the integration tests are working reliably we must be able to fix this issue somehow as one of the vault integration tests is about registering the vault. I added some changes in the hope that a closer replica of the code used by the integration test would help but this doesn't seem to be the case unfortunately.

TorstenStueber commented 2 months ago

@gianfra-t Understood now why there are all the extra changes. Let's maybe mark this PR as DRAFT as long as this is problem is not resolved.

Regarding the bug: it boils down to not being able to detect that a transaction is finalized? Is it possible to isolate this? E.g., have minimal code that submits the transaction and waits for it to be finalized and not being able to detect on Linux? That way, we might conclude that this is subxt bug and file a bug report.

gianfra-t commented 2 months ago

Yes it boils to that. I think it is more likely the finalized event is not received (maybe even not sent?) rather than not detected on the subxt side. It should be possible to isolate, we may have to simplify the logic such that the code for the transaction and the .await on the transaction finality is all in the test itself for readability. The problem with reproducing this bug is that you must also have the parachain running, not to mention the fact that it fails only "half" the time.

@TorstenStueber regarding the extra temporary changes, I think I removed them in the lasts commits. Maybe I missed some of them. I will later make another pass at the full changes to look for that.

TorstenStueber commented 2 months ago

How do you want to proceed? Testing this in isolation is quite some effort but I think we need to solve this before we can continue (or is it less crucial than it looks to me?) Is there any other less effort way to understand where the bug originates?

gianfra-t commented 2 months ago

I'm not sure how to proceed. Luckily this is not something we see on the integration tests, which is a strong hint that this must be something only related to the test environment. It also makes this less crucial.

I think the less effort way may be to post an issue on the substrate or subxt repo to see if someone has encountered this before or is aware. Alternatively we can try to track where this finalized event is originated and log to see if it really is originated, I tried this but was unable to find the polkadot crate that is responsible for this on the node.

ebma commented 2 months ago

Testing this in isolation is quite some effort but I think we need to solve this before we can continue (or is it less crucial than it looks to me?)

Since the integration tests work, I'd say it's not that crucial and we can assume that the vaults still work properly. So to unblock the further changes for upgrading we could also merge this without having those particular tests fixed and follow up with a new PR. We should do that soon after though.

I think the less effort way may be to post an issue on the substrate or subxt repo to see if someone has encountered this before or is aware.

Sure, we can try. @gianfra-t would you be able to raise that ticket? Maybe someone can point us in the right direction.

b-yap commented 2 months ago

Sure, we can try. @gianfra-t would you be able to raise that ticket?

@gianfra-t just create a ticket, I can pick it up and try to find a fix too.

gianfra-t commented 2 months ago

Sure, I will create that issue today, sorry I didn't yesterday. Do you think it's better to raise it on subxt or polkadot repo?

b-yap commented 2 months ago

@gianfra-t I believe in subxt.

sorry I didn't yesterday.

Aaah don't apologize it's alright!! I also meant creating a ticket here in our repo, so I can check it out in a different light. Continuing from this ticket is too lengthy.

gianfra-t commented 2 months ago

That sounds good. Discussing about the issue and the update is too much here!

gianfra-t commented 2 months ago

@pendulum-chain/devs I submitted the issue here.

ebma commented 2 months ago

Thanks @gianfra-t 👍 might be worth mentioning in the ticket that this only happens on Linux but works on macOS.

gianfra-t commented 2 months ago

I totally forgot about that "little" fact... thanks for reminding me @ebma ! I will edit it at the end.

gianfra-t commented 2 months ago

I have commented out the problematic tests such that we can finish everything only leaving those as the last step. Let me know what you think.

gianfra-t commented 1 month ago

@pendulum-chain/devs I don't think the test will pass with ubuntu. I think we will have to migrate to mac-os in order to solve the issue explored on this pr.

@b-yap I can see that you were able to fix and make the CI work with mac-os, and that the test passed successfully. Yet at least the last attempt with ubuntu failed due to the timeout error.

gianfra-t commented 1 month ago

EDIT: please ignore this comment as we will no longer need this feature, we can use it as in this commit.

@ebma the all-features is used for filtering out the application of the metadata types multiple times when all the features are enabled. Apparently this was not a problem with the previous versions of subxt and it probably took the last one, but after the upgrade we got a collision since it tried to use the types from all the chains on the substitution.

Since when running the test like we generally do with the --all-features command it will as well enable this one. The name does not necessarily needs to match and could be anything, it will just be enabled due to the command as well.

gianfra-t commented 1 month ago

@pendulum-chain/devs I wanted to comment on the test issue we were facing a while ago, and summarize how we are dealing with it at the moment.

As a summary of the attempts and experiments using extra logs in this PR we identify that the issue happens when sealing a block on linux operating systems (especially in low spec CI runners), the creation of a new block and the application of the inherents consumes all the time specified by the whole process of applying extrinsics, therefore ending up hitting the deadline condition before applying a single extrinsic. Please refer to the following example CI run, logs were set before and after each of these operations. We can see that we spend already 4 seconds on each by the timestamps.

Therefore, in order to workaround this issue, we modify this constant to a larger number (in this case, 60 seconds) such that the extrinsics have enough time to be applied. We can then use this forked polkadot-sdk repo only on the CI runs, by just renaming the Cargo.toml that contains the patch during the run.

pendulum-chain / spacewalk