New Command Line Option: Check If Migration Is Needed

Is your feature request related to a problem? Please describe. Our deployment process is fairly hands off when it comes to individual services, and the services themselves are responsible for migrations. We are using Mongo, so we don't have to write that many migrations, mainly just for creating collections with specific options, or modifying indexes. Our database migrations are pretty much always backwards compatible. One issue we have however is that when the service starts up to perform the migration, if the database is using a newer version, it borks.

This can happen for a few reasons:

A newer version of the service is currently migrating data.
We had to rollback to a previous version.

Right now, the previous version fails to start.

This is very related to another feature request I made #468 , however I did take a stab at that one, and I feel it would be too hard for me to implement without a rearchitecture (since golang-migrate doesn't seem to know how many migrations there are to perform).

Describe the solution you'd like I thought about it, and what would be nice is a new command to the CLI. check or uptodate or whatever, it would return 0 if there was no migration needed, or 1 if there was. We could also have additional options such as:

allow-newer which if set would return 0 if the database was a newer version of the db. allow-dirty which if set would return 0 if the database was currently locked.

This would allow us to simply run this command before running up in a two step process. Now obviously there is no lock here, and there is a race condition, but for our purposes it is fine since the database is always newer.

Describe alternatives you've considered

Doing things in the application could work without go-migrate, but the issue is that some calls to Mongo, at least older versions of Mongo (haven't tested with 4.4) do not have a happens before relationship with completion. My experimentation found that when you have lots of data if multiple processes ask to create a unique index, the first one will block until the index is built and/or fail, the second one will just return right away, so the application believes that the index is built, when it isn't.

Additional context This is something that I think could be built without a massive change to the structure of the code, I just wanted to have feedback before implementing the feature, additional if better terms or options were requested, etc...

Reading the latest applied version doesn't require a lock. Locking is only required when updating the applied version.

There's no function or CLI command to get the latest version from the source. Was your plan to keep calling Next() until you hit an error? Or were you planning on added a Last() method to source.Driver?

I'm against exposing a way to get the latest source version (via a function/method and CLI command) since it increases the scope and complexity of this project and there are other ways and tools to get the latest source version for each source.

What's your deploy process look like? Can you run migrations on a canary instance and then load testing/warm up later? What's the issue you're encountering with rollbacks? Are you running down migrations manually via the CLI to the previous version before rolling back? Would a canary help?

Locking to prevent multiple app versions using multiple schema versions is generally a good thing.

Was your plan to keep calling Next() until you hit an error? Or were you planning on added a Last() method to source.Driver?

I think it was to call Next() until I hit an error. I haven't looked at the code in a few months, but from what I recall messages are being sent with a channel, simultaneously with migrations being processed, so I was going to try and just have the migrations, not be processed. Last() might also be doable and would be more performant in some cases, but I care more about the CLI interface than the implementation for me, so whatever you would be good with is fine.

I'm against exposing a way to get the latest source version (via a function/method and CLI command) since it increases the scope and complexity of this project and there are other ways and tools to get the latest source version for each source.

I'm not sure if we are talking past each other a bit. I would like to add a command that tells me if we need to run migrations, not necessarily to allow determining the actual last version in a bash script. That said I think exposing more information via the CLI is potentially more robust since it makes scripting easier, and prevents breakage. I could implement this request for instance by installing the mongo command line tools, and scraping the file system, but that is very brittle and cumbersome. It'd be much easier and composable if I could just get important meta data via the CLI. It would also help kind of cleanly separate policy from mechanism, this project being a mechanism for managing migrations, but allowing for users to specify and implement lots of policies.

What's your deploy process look like?

Essentially we have a container, and it tries to follow the 12 Factor App of managing things in a single container. We also focus more on choreography as opposed to orchestration with respect to our deployments, so the containers are responsible for running migrations, as opposed to a central ansible script that must be modified.

This also helps because locally we are using docker-compose and so we just start up the containers and then the migrations are done locally with a clean slate, instead of local developers having to manually remember to run and keep their dbs up to date.

Also very importantly we have zero downtime deployments, so the migrations we are performing have to run concurrently with application requests, which may mutate the state of the db, and this being Mongo, there is not very much in terms of atomicity guarantees or transactions. This very much limits the scope of the changes we can do (for instance, we used Versioned Documents in Mongo, as opposed to large scale data transforms).

Can you run migrations on a canary instance and then load testing/warm up later?

No unfortunately that is beyond our automation capabilities aat the moment, and an issue is that I have run into repeatedly is that you mostly need to have your migrations do a dry run on production before doing it, as you can have either data issues, or performance issues that pop up only on productions.

What's the issue you're encountering with rollbacks?

In 6 years of having automated migrations with liquibase and go-migrate, I've never found rollbacks to be particularly useful. At my previous company we would just restore a backup. One major issue I have with roll backs is that they need to be tested and no one tests them, especially in an automated way, so there is a ton of overhead to be introduced. So I've never been a fan. Additionally currently we have largely followed a decoupling the DB version from the Application Version pattern, so our "migrations" largely won't break anything, or have any need to rollback.

If we wanted to have rollbacks there are one (or two intertwined) problems. Managing the rollback logic. Right now we just use migrate up which runs everything to the latest version, we could change containers to use migrate up [N], so version 10 of the container might have migrate up 13 and version 11 might have migrate up 16 and version 12 might have migrate up 22. This would be more work for devs, but the roll back would be more complex. For an environment that is deployed on every merge, when we roll back we face a problem. Version 12 is the only thing that can rollback versions [17,22], but Version 11 is the only thing that knows what version it should run. Production or a nightly environment might be running Version 10, and so if it rollbacks you need to rollback further. This would push a lot of complexity into the deployment process to manage.

Are you running down migrations manually via the CLI to the previous version before rolling back?

We aren't running them at all.

Would a canary help?

I don't think it would give me the robustness I want. I mean the above approach also solves another problem we haven't talked about. Say for instance I am creating an index, and that migration takes 5 minutes to run, in version 2. Version 1 can run happy as a clam, except when we start scaling version 1, when version 1 starts, it needs to know whether the db is at least up-to-date as it wants it to be. So right now version 1 acquires a lock, and waits for the version 2 migration to run to completion. If we need to scale we can't. This command would let us have more scripting capabilities in migrate, so that we could achieve our own ends. We could simply not run the migrations before the app starts, but that introduces lots of safety risks. In particular if an index is critical, such as unique index, then the application may be running without it's invariants. We also then have to remember to run things locally on our machine, so it introduces over head.

Was your plan to keep calling Next() until you hit an error? Or were you planning on added a Last() method to source.Driver?

I think it was to call Next() until I hit an error. I haven't looked at the code in a few months, but from what I recall messages are being sent with a channel, simultaneously with migrations being processed, so I was going to try and just have the migrations, not be processed. Last() might also be doable and would be more performant in some cases, but I care more about the CLI interface than the implementation for me, so whatever you would be good with is fine.

👍. If we do this, let's go with successive calls to Next() to avoid any interfaces changes. The performance hit should be negligible in most cases.

I'm against exposing a way to get the latest source version (via a function/method and CLI command) since it increases the scope and complexity of this project and there are other ways and tools to get the latest source version for each source.

I'm not sure if we are talking past each other a bit. I would like to add a command that tells me if we need to run migrations, not necessarily to allow determining the actual last version in a bash script. That said I think exposing more information via the CLI is potentially more robust since it makes scripting easier, and prevents breakage. I could implement this request for instance by installing the mongo command line tools, and scraping the file system, but that is very brittle and cumbersome. It'd be much easier and composable if I could just get important meta data via the CLI. It would also help kind of cleanly separate policy from mechanism, this project being a mechanism for managing migrations, but allowing for users to specify and implement lots of policies.

Sorry, I didn't provide enough context in my original comment! We shared a similar line of thinking. A single use command for checking if the latest migration has been applied has limited utility, so a command that exposes the latest source version would be more useful. But I don't want to expand the scope of the CLI tool to include this for the same reasons why you don't want to do it. The work to get the latest source migration version needs to be done to solve your problem but I don't think it belongs in migrate.

Also, what source driver(s) are you using? You shouldn't need to install the mongo CLI since the migrate has the version command which gets the latest applied migration version.

What's your deploy process look like?

Essentially we have a container, and it tries to follow the 12 Factor App of managing things in a single container. We also focus more on choreography as opposed to orchestration with respect to our deployments, so the containers are responsible for running migrations, as opposed to a central ansible script that must be modified.

This also helps because locally we are using docker-compose and so we just start up the containers and then the migrations are done locally with a clean slate, instead of local developers having to manually remember to run and keep their dbs up to date.

Also very importantly we have zero downtime deployments, so the migrations we are performing have to run concurrently with application requests, which may mutate the state of the db, and this being Mongo, there is not very much in terms of atomicity guarantees or transactions. This very much limits the scope of the changes we can do (for instance, we used Versioned Documents in Mongo, as opposed to large scale data transforms).

Can you run migrations on a canary instance and then load testing/warm up later?

No unfortunately that is beyond our automation capabilities aat the moment, and an issue is that I have run into repeatedly is that you mostly need to have your migrations do a dry run on production before doing it, as you can have either data issues, or performance issues that pop up only on productions.

What's the issue you're encountering with rollbacks?

In 6 years of having automated migrations with liquibase and go-migrate, I've never found rollbacks to be particularly useful. At my previous company we would just restore a backup. One major issue I have with roll backs is that they need to be tested and no one tests them, especially in an automated way, so there is a ton of overhead to be introduced. So I've never been a fan. Additionally currently we have largely followed a decoupling the DB version from the Application Version pattern, so our "migrations" largely won't break anything, or have any need to rollback.

If we wanted to have rollbacks there are one (or two intertwined) problems. Managing the rollback logic. Right now we just use migrate up which runs everything to the latest version, we could change containers to use migrate up [N], so version 10 of the container might have migrate up 13 and version 11 might have migrate up 16 and version 12 might have migrate up 22. This would be more work for devs, but the roll back would be more complex. For an environment that is deployed on every merge, when we roll back we face a problem. Version 12 is the only thing that can rollback versions [17,22], but Version 11 is the only thing that knows what version it should run. Production or a nightly environment might be running Version 10, and so if it rollbacks you need to rollback further. This would push a lot of complexity into the deployment process to manage.

Agreed that using migrate up [N] in this case adds complexity and isn't worth it.

Are you running down migrations manually via the CLI to the previous version before rolling back?

We aren't running them at all.

Would a canary help?

I don't think it would give me the robustness I want. I mean the above approach also solves another problem we haven't talked about. Say for instance I am creating an index, and that migration takes 5 minutes to run, in version 2. Version 1 can run happy as a clam, except when we start scaling version 1, when version 1 starts, it needs to know whether the db is at least up-to-date as it wants it to be. So right now version 1 acquires a lock, and waits for the version 2 migration to run to completion. If we need to scale we can't. This command would let us have more scripting capabilities in migrate, so that we could achieve our own ends. We could simply not run the migrations before the app starts, but that introduces lots of safety risks. In particular if an index is critical, such as unique index, then the application may be running without it's invariants. We also then have to remember to run things locally on our machine, so it introduces over head.

Thanks for sharing more about your deploy process! Can you do rolling deploys to a subset of nodes in prod instead of all at once?

It seems that the double-checked locking you mentioned is a better solution for this problem. This proposal seems to facilitate the initial lock check. e.g. allow the deploy process to check the lock before running migrations, which acquire the lock I'd rather have a postgres specific option to enable double checked (e.g. x-double-checked-lock) to unblock you. If enough other db drivers also need double checked locks we can factor out this logic to migrate, which would probably require adding a new method to the database.Driver interface. A few notes before implementing:

We'd need to refactor the pgx and postgres driver first so they can both leverage this logic
Properly document the new option with safety warnings and the very specific use case this was added for e.g. no canary or rolling deploys and deploying new app versions to all nodes at once, which are responsible for running migrations

I'm happy to accept a double lock implementation for postgres, but honestly, at scale, you're better off supporting canary and/or rolling deploys which allow time to catch issues in prod and minimize impact.

+1. If we do this, let's go with successive calls to Next() to avoid any interfaces changes. The performance hit should be negligible in most cases.

Sounds good

Also, what source driver(s) are you using? You shouldn't need to install the mongo CLI since the migrate has the version command which gets the latest applied migration version.

Ah yes, I didn't notice the version command that sort of changes things.

We are using the filesystem source driver.

A single use command for checking if the latest migration has been applied has limited utility, so a command that exposes the latest source version would be more useful. But I don't want to expand the scope of the CLI tool to include this for the same reasons why you don't want to do it. The work to get the latest source migration version needs to be done to solve your problem but I don't think it belongs in migrate.

The main reason I don't want to write code to determine the latest source migration version available is because it is because it circumvents the interface and encapsulation of the migrate tool. It also isn't portable or reusable for people who use other migration sources.

Also our entrypoint code becomes more tedious, instead of something like:

/migrate -database "$dsn" -path /db/migrations check
if [ $? -ne 0]; then 
   /migrate -verbose \
      -database "$dsn" \
      -path /db/migrations \
     up
fi

It would be something like:

MIGRATION_COUNT=$(find ./db/migrations/*.up.sql | wc -l)

if [ $? -ne 0]; then
 exit 1;
fi

CURRENT_VERSION=$(/migrate -database "$dsn" -path /db/migrations check)
if [$? -ne 0]; then
  exit 2;
fi

if [$MIGRATION_COUNT -ne $CURRENT_VERSION]; then
   /migrate -verbose \
      -database "$dsn" \
      -path /db/migrations \
     up
fi

It's not the end of the world, but I'm not a fan of that, but we might end up doing that if we don't make a change.

Thanks for sharing more about your deploy process! Can you do rolling deploys to a subset of nodes in prod instead of all at once?

We can and do, but that's where run into problems currently. When containers start up, we want to make sure the db version is at least as new as the container wants. When migrations are running the old versions of the containers can't start up because the lock is held. The thing that is being mutated is the db state which is sort of this singleton, or something independent of application deploys that we want to manage the evolution of in a controlled way.

It seems that the double-checked locking you mentioned is a better solution for this problem

I think we could do that, if we also made an option to up to allow it to not error the db version is newer.

I think the reason I kind of became cool to double check locking, is because I'd have to essentially do one of the following:

read all the migrations twice
add an intermediary go routine here

ret := make(chan interface{}, m.PrefetchMigrations)

go m.readUp(curVersion, -1, ret)
return m.unlockErr(m.runMigrations(ret))

I actually don't think the drivers need to change at all. I just haven't had time to dig into Go very much and am still new, but essentially have the next week to focus in and drill into this problem.

That essentially receives all the migrations, and then only decides to send them to the runMigrations if necessary (or something).

I played around with this, and I think I may have something that works, it is much simpler if I can read the migrations more than once. I will send an up a PR tomorrow. Most of the changes are actually in migrate.

We can and do, but that's where run into problems currently. When containers start up, we want to make sure the db version is at least as new as the container wants. When migrations are running the old versions of the containers can't start up because the lock is held. The thing that is being mutated is the db state which is sort of this singleton, or something independent of application deploys that we want to manage the evolution of in a controlled way.

How over-provisioned are your clusters? If the rolling deploy size is smaller than the over-provisioned amount, then there shouldn't be any issues unless there's an unexpected traffic spike 🙈

I actually don't think the drivers need to change at all. I just haven't had time to dig into Go very much and am still new, but essentially have the next week to focus in and drill into this problem.

I think you're right. My original suggestion to only make the change in the driver won't work. It essentially disabled locking... We still need to check the migration version to determine whether or not the lock should be acquired or not which isn't a straightforward action for the DB driver to do due to the complexities around locking and other DB driver actions like running migrations and setting the applied migration version.

The simplest path forward is to fetch and compare the source and applied migration versions. e.g. the 2nd shell snippet you provided

How over-provisioned are your clusters? If the rolling deploy size is smaller than the over-provisioned amount, then there shouldn't be any issues unless there's an unexpected traffic spike see_no_evil

Not that over provisioned actually, but it's not just traffic spikes it's container failures as well, if a pod needs to be restarted, it can't be. We have auto scaling, so we have no need really over provision, and it's expensive when you multiply out the number of services.

I think you're right. My original suggestion to only make the change in the driver won't work. It essentially disabled locking... We still need to check the migration version to determine whether or not the lock should be acquired or not which isn't a straightforward action for the DB driver to do due to the complexities around locking and other DB driver actions like running migrations and setting the applied migration version.

I think I was looking at something like this in migrate.go.

// Up looks at the currently active migration version
// and will migrate all the way up (applying all up migrations).
func (m *Migrate) Up(doubleCheckLocking bool, ignoreMissingMigrationVersions bool) error {
        // NEW BLOCK
    if doubleCheckLocking {
        latestVersion, err := util.GetLast(m.sourceDrv)
        if err != nil {
            return err
        }

        curVersion, dirty, err := m.databaseDrv.Version()

        if (err != nil) {
            return err
        }

        if !dirty {
            if curVersion == int(latestVersion) {
                return nil
            } else if (curVersion > int(latestVersion) && ignoreMissingMigrationVersions) {
                return nil
            }
        }
    }

    // END NEW BLOCK

    if err := m.lock(); err != nil {
        return err
    }

    curVersion, dirty, err := m.databaseDrv.Version()
    if err != nil {
        return m.unlockErr(err)
    }

    if dirty {
        return m.unlockErr(ErrDirty{curVersion})
    }

        // NEW BLOCK
    if ignoreMissingMigrationVersions {
        latestVersion, err := util.GetLast(m.sourceDrv)
        if err != nil {
            return m.unlockErr(err)
        }

        if(curVersion > int(latestVersion)) {
            return m.unlockErr(nil)
        }
    }
        // END NEW BLOCK
    ret := make(chan interface{}, m.PrefetchMigrations)

    go m.readUp(curVersion, -1, ret)
    return m.unlockErr(m.runMigrations(ret))
}

Then changing the command line interface with new options.

The simplest path forward is to fetch and compare the source and applied migration versions. e.g. the 2nd shell snippet you provided

Yeah I thought about this a bit more, I really don't like the shell scripting being done above, but maybe that is indeed better.

Sorry for the delayed response!

Node restarts during a migration is a real issue. e.g. old versions should be able to start up without waiting for new migrations to finish running Running migrations on a canary instance won't fix this issue... I wonder if this can be fixed by adding the target migration version to the lock key. But I don't think adding the target migration version (or any migration) version to the lock key is trivial. This approach might have lock contention issues with a lot of nodes, but it's an incremental improvement.

I believe double check locking will fix also this issue but I'm not sure if there are race conditions we need to worry about.

Let's hold off on implementing double check locking in migrate until there are more use cases and we're confident about not having race conditions. I think implementing the double check lock in a shell script works for your use case for now.

I thought about how to handle current versions restarting while the next versions are being applied.

This issue exists for any long running migrate operation (like Up(), Down(), Migrate(), Steps(), and Run()), so any fix also needs to be applied for these operations. Note: Force() and Version() aren't long running operations for any db drivers I'm aware of.

Double lock checking (e.g. version check before locking) doesn't completely solve the problem. If all of the nodes are restarted to apply the next version, they could all pass the first check and end up blocking on the lock anyways.

If we implement a solution within migrate, it should be a safe one that always works. I don't think the complexity of double check locking is worth the mitigations it provides in the general case since everyone's deployment scenario is different. You could add a random jitter (properly seeded) before the version check and multiple rounds of version checks with retries to increase the chance of a successful mitigation but these choices should be left to the user. Adding all of these options bloats migrate and would only be removed once we settle on another solution to this problem, so I think it's best for the users to implement any mitigations before attempting long running migrate operations. Do you have other ideas for solving this problem?

In the meanwhile, we can add some double check locking (e.g. version check before locking) code snippets and recommendations to the docs.

+1 for an 'allow-newer' command built in. We have similar deploy and rollback situation to OP and this feature exists in other libraries (flyway for example)

Any maintainer still interested in having this feature? Can I put up a PR? Open a new issue?

golang-migrate / migrate

New Command Line Option: Check If Migration Is Needed #556