Enforce migrations to be standalone scripts

kianenigma commented 2 years ago

Follow up on the discussion here: https://github.com/paritytech/substrate/pull/10233#discussion_r749441704. Currently, we keep some migration codes in the code, and even those that we do are pretty much useless. Given this our support for historical migrations are basically near-nothing.

I propose we enforce all migrations to be self-containing henceforth. This means that they should not depend on anything from the pallet.

If they need a struct, they must duplicate that version of it in their own local scope. If they need a particular storage item, they can mimic it using generate_storage_alias.

Next to this, we are moving toward banning on_runtime_upgrade inside the pallet (https://github.com/paritytech/substrate/issues/8713). This is aligned with this issue as well. To begin with, migrations should be controlled explicitly by the top-level runtime.

Once we have all of this, I belive we can make migrations fully automated. Each migration will have a u32 id. Each pallet will have an on-chain u32 identifier, akin to the current storage version. Upon each runtime upgrade, if any migrations have been added to the repository will be executed. These migrations need to be added to some static registry.

Regardless of realizing the last statement or not, in order to for the migrations codes that we keep in substrate to have any utility, we better start by making them independent of the pallet. Some examples of this already exist (pallet-elections-phragmen).

KiChjang commented 2 years ago

Would this make migrations look more or less akin to database migrations for a server backend? I think that should ideally be what we're aiming for here.

kianenigma commented 2 years ago

Would this make migrations look more or less akin to database migrations for a server backend? I think that should ideally be what we're aiming for here.

totally.

ruseinov commented 1 year ago

I really like the idea of migrations not depending on the pallet code, however I'd like to chip in with my 2 cents on the subject, before we go down this particular rabbit hole.

First of all, this will increase the bloat, because we'll need to redefine the structs used for the storage. In case of one or two structs that's kind of okay, when this becomes more complex with some nesting levels - different game.

Secondly, it's going to become harder for newcomers to write migrations efficiently, especially when we start talking about more complex scenarios, which I'm going to provide as examples below.

Last, but not least, it's kind of nice to have compile-time guarantees when writing migrations with more or less complicated structs involved, as when we are reusing the storage structs from pallets - we know for sure the struct resembles the storage. It's not that big of a deal as each migration needs to be tested anyway, but it still adds to the overall migration testing/implementation burden.

Some examples of migrations that no longer work due to the fact that the code has changed:

Staking, with some comments within the code here

pub mod v8 { use super::*; use crate::{Config, Nominators, Pallet, Weight}; use frame_election_provider_support::SortedListProvider; use frame_support::traits::Get;

#[cfg(feature = "try-runtime")]
pub fn pre_migrate<T: Config>() -> Result<(), &'static str> {
    frame_support::ensure!(
        StorageVersion::<T>::get() == ObsoleteReleases::V7_0_0,
        "must upgrade linearly"
    );

    crate::log!(info, "👜 staking bags-list migration passes PRE migrate checks ✅",);
    Ok(())
}

/// Migration to sorted `VoterList`.
pub fn migrate<T: Config>() -> Weight {
    if StorageVersion::<T>::get() == ObsoleteReleases::V7_0_0 {
        crate::log!(info, "migrating staking to ObsoleteReleases::V8_0_0");
                    // We will no longer have the Bags-List pallet instance here, so we'll have to reimplement this functionality
                   // which relies in several method and storage structs that are called under the hood.
        let migrated = T::VoterList::unsafe_regenerate(
                             // replicate the Nominators struct
            Nominators::<T>::iter().map(|(id, _)| id),
            Pallet::<T>::weight_of_fn(),
        );
                     // still need to call that here on in post_migrate
        debug_assert_eq!(T::VoterList::try_state(), Ok(()));

        StorageVersion::<T>::put(ObsoleteReleases::V8_0_0);
        crate::log!(
            info,
            "👜 completed staking migration to ObsoleteReleases::V8_0_0 with {} voters migrated",
            migrated,
        );

        T::BlockWeights::get().max_block
    } else {
        T::DbWeight::get().reads(1)
    }
}

#[cfg(feature = "try-runtime")]
pub fn post_migrate<T: Config>() -> Result<(), &'static str> {
    T::VoterList::try_state().map_err(|_| "VoterList is not in a sane state.")?;
    crate::log!(info, "👜 staking bags-list migration passes POST migrate checks ✅",);
    Ok(())
}

}


2. Staking 
```rust
pub mod v9 {
    use super::*;
    #[cfg(feature = "try-runtime")]
    use frame_support::codec::{Decode, Encode};
    #[cfg(feature = "try-runtime")]
    use sp_std::vec::Vec;

    /// Migration implementation that injects all validators into sorted list.
    ///
    /// This is only useful for chains that started their `VoterList` just based on nominators.
    pub struct InjectValidatorsIntoVoterList<T>(sp_std::marker::PhantomData<T>);
    impl<T: Config> OnRuntimeUpgrade for InjectValidatorsIntoVoterList<T> {
        fn on_runtime_upgrade() -> Weight {
            if StorageVersion::<T>::get() == ObsoleteReleases::V8_0_0 {
                                 // need to replicate bags list storage and some of the SortedListProvider methods
                let prev_count = T::VoterList::count();
                let weight_of_cached = Pallet::<T>::weight_of_fn();
                                 // replicating validators struct
                for (v, _) in Validators::<T>::iter() {
                    let weight = weight_of_cached(&v);
                                         // a bunch more VoterList action that needs to be taken into account

                    let _ = T::VoterList::on_insert(v.clone(), weight).map_err(|err| {
                        log!(warn, "failed to insert {:?} into VoterList: {:?}", v, err)
                    });
                }

                log!(
                    info,
                    "injected a total of {} new voters, prev count: {} next count: {}, updating to version 9",
                    Validators::<T>::count(),
                    prev_count,
                    T::VoterList::count(),
                );

                StorageVersion::<T>::put(ObsoleteReleases::V9_0_0);
                T::BlockWeights::get().max_block
            } else {
                log!(
                    warn,
                    "InjectValidatorsIntoVoterList being executed on the wrong storage \
                version, expected ObsoleteReleases::V8_0_0"
                );
                T::DbWeight::get().reads(1)
            }
        }

        #[cfg(feature = "try-runtime")]
        fn pre_upgrade() -> Result<Vec<u8>, &'static str> {
            frame_support::ensure!(
                StorageVersion::<T>::get() == ObsoleteReleases::V8_0_0,
                "must upgrade linearly"
            );

            let prev_count = T::VoterList::count();
            Ok(prev_count.encode())
        }

        #[cfg(feature = "try-runtime")]
        fn post_upgrade(prev_count: Vec<u8>) -> Result<(), &'static str> {
                        // try_state call
            let prev_count: u32 = Decode::decode(&mut prev_count.as_slice()).expect(
                "the state parameter should be something that was generated by pre_upgrade",
            );
            let post_count = T::VoterList::count();
            let validators = Validators::<T>::count();
            assert!(post_count == prev_count + validators);

            frame_support::ensure!(
                StorageVersion::<T>::get() == ObsoleteReleases::V9_0_0,
                "must upgrade "
            );
            Ok(())
        }
    }
}

In case of replicating, say, BagsList pallet storage and methods we'd have to replicate several pretty intricate methods along with their corresponding structs:

And this is just the tip of the iceberg.

We could also take a look at a simple migration, like that one, which is already quite bloated and it's going to be double the bloat.

Long story short I'm not sure that the added complexity and time spent implementing the migrations is worth being pallet-agnostic.

Why? Breaking an older migration code is a minor concern, it can be commented out and/or provided as code example for posterity.

Running prior migrations is not a problem either, just using release branches / webassembly code from older versions. It's a bit of a pain, but I don't see this happening often either.

Happy to be proven wrong.

bkchr commented 1 year ago

Running prior migrations is not a problem either, just using release branches / webassembly code from older versions. It's a bit of a pain, but I don't see this happening often either.

You are forgetting that Polkadot isn't the only user of Substrate. For staking this currently maybe being true, but most of the other pallets are being used by a lot of Parachains etc. Your complaints are nevertheless correct, but we should work on this. We need to identify common patterns and then provide abstractions/macros/tools to help with that.

ggwpez commented 1 year ago

https://github.com/paritytech/substrate/issues/13107 tries to abstract the redundant version tracking and logging.

In case of replicating, say, BagsList pallet storage and methods we'd have to replicate several pretty intricate methods along with their corresponding structs:

Uhf… yea that is annoying. Having mutation functions on stored structs makes this much more complicated. I dont see how we could circumvent that besides not using that pattern in the first place. Which is sub-optimal, since its in itself a good pattern.

ruseinov commented 1 year ago

Running prior migrations is not a problem either, just using release branches / webassembly code from older versions. It's a bit of a pain, but I don't see this happening often either.

You are forgetting that Polkadot isn't the only user of Substrate. For staking this currently maybe being true, but most of the other pallets are being used by a lot of Parachains etc. Your complaints are nevertheless correct, but we should work on this. We need to identify common patterns and then provide abstractions/macros/tools to help with that.

For now if anybody wants to migrate from a very old storage - they could use old release branches to gradually run whatever migrations they need, git blame will help them to find the right release. I know it's not ideal, but it's definitely doable. At the moment I can't see how macros or other things could help us deal with, for example, BagsList case.

ruseinov commented 1 year ago

paritytech/substrate#13107 tries to abstract the redundant version tracking and logging.

In case of replicating, say, BagsList pallet storage and methods we'd have to replicate several pretty intricate methods along with their corresponding structs:

Uhf… yea that is annoying. Having mutation functions on stored structs makes this much more complicated. I dont see how we could circumvent that besides not using that pattern in the first place. Which is sub-optimal, since its in itself a good pattern.

If I'm being honest I don't see why the actual library code should suffer for migration convenience. If anything it should be the other way around. Migrations are simply one-off helpers that serve the purpose, once.

It's a bit annoying having to potentially use several different release branches to gradually upgrade, but it can be done and it's not that hard. Maybe we should provide some sort of convenience tooling for that instead?

ruseinov commented 1 year ago

Now that I re-read the original issue it's a bit more clear to me what the purpose of this is: migration automation.

Still, I can't see how this could be easily achieved as decoupling migrations from pallets is tricky enough, but we also still need to specify the correct storage version of each pallet should anyone start from the current codebase. And if this is wrong and somehow slips into master - that'll be potentially a huge pain as things would fail while trying to roll those migrations out automatically.

It seems like manual control over migrations similar to what we have now, having to add them to individual runtimes, might be a better option. It's a bit more work, but it gives more fine-grained control of what's happening, since we need to compile a new runtime anyway in order to introduce any new code.

I've had similar experiences with SQL migrations on various projects. If the migration is complex enough - it might be a multi-step one, that needs to be run and monitored manually. Even if it's not that complex - it's almost always a dangerous pattern to let it be run automatically or via CI. In our case the automation would come from simply compiling a new runtime with new migrations in place. One other concern is: what if we introduce 10 migrations to 10 different pallets, if all of those run back-to-back - that could potentially be a concern.

All of the above concerns might be good to consider in order to see how we could mitigate some if not all of them while still providing a reasonable level of automation.

kianenigma commented 1 year ago

I think we should not settle this yet and as Basti said try and find patters that are common and help migration scripts live longer. One of them is this:

https://github.com/paritytech/substrate/blob/0246883c404d498090f33e795feb8075fa8d3b6b/frame/elections-phragmen/src/migrations/v3.rs#L41-L48

Which means:

don't rely on the definition of storage structs from your pallet
don't rely on any type coming in from trait Config of your pallet

Using, your PR which sparked this discussion can be solved fairly easily https://github.com/paritytech/substrate/pull/13079/commits/2c6275e91d188ccc236ec85760f05ba0d3aff01a as the only breaking change is the introduction of ReadOnlySortedListProvider.

That being said, this is still not perfect because i.e. if the definition of SortedListProvider changes and one of the functions used in the migrations are i.e. removed, then we would again be trouble. That being said, again, that is also solvable.

TLDR; I don't have a plan here, but I am not at a spot where I can also say screw the old migrations and let's move on. If Substrate wants to be ultimate, user-friendly framework that it wants to be, it must solve this issue, and I am down to put more effort into it.

@ruseinov as your comments in https://github.com/paritytech/polkadot-sdk/issues/296, these are more about automation which is the end goal, but not the immediate issue. We first have to make migration scripts reusable, then think about automation, so I will not comment further on that for the time being.

It's a bit annoying having to potentially use several different release branches to gradually upgrade, but it can be done and it's not that hard. Maybe we should provide some sort of convenience tooling for that instead?

That's not a bad idea nonetheless. But again, if you look at the fixes needed in https://github.com/paritytech/substrate/pull/13079/commits/2c6275e91d188ccc236ec85760f05ba0d3aff01a, it is actually not that much overhead in most cases to keep the migrations sane.

The idea that I am now eager to implement and show to all of you as PoC is this: I will migrate all of the migrations of the nomination pools pallet to be 100% independent of the runtime in some PR, and we can then see what the outcome looks like. If someone else has the time, please go ahead.

athei commented 1 year ago

Let us take a step back and think what would be the most convenient and easiest way to deliver migrations for our users:

I think it is the old way of bundling them with the pallet. The pallet knows its current version and the version in storage. It runs all necessary migrations. The runtime just calls into all pallets for migrations on upgrade. Everything just works.

But there are reasons why we are no longer doing it this way. Let me see if I got all of them:

Migrations consume space in the runtime's code. We don't want them in the pallets and hence in the productive runtime.
Migrations are more opaque to runtime authors when handled by the pallets. This is nice but this might lead to a situation where they compound to something that is heavier than a single block.

I think we kind should find ways to address both issues instead of dumping this hugely complicated topic onto our users.

ruseinov commented 1 year ago

The idea that I am now eager to implement and show to all of you as PoC is this: I will migrate all of the migrations of the nomination pools pallet to be 100% independent of the runtime in some PR, and we can then see what the outcome looks like. If someone else has the time, please go ahead.

This is not documented in the issue, but I think @kianenigma is not that big of a fan of this idea anymore, as discussed during FRAME steering meet. Or was it Staking meet? I don't remember. Anyhow, since we don't have a unified way/language (SQL-like) to deal with this and a lot of the migrations really need to mimic the business-logic of pallets whose storage they are touching - it seems like the best approach is to do what we've done before.

Migrations consume space in the runtime's code. We don't want them in the pallets and hence in the productive runtime.

It seems like a good idea is to clean up old migrations. And let the users deal with this by gradually upgrading, if they need to. But yeah, the UX of this is not amazing.

paritytech / polkadot-sdk

Enforce migrations to be standalone scripts #296