Aborting `migrate apply` leaves the database in an inconsistent state

etrepum commented 3 years ago

We currently run our PR builds on Heroku using the cli-migrations-v3 image. In situations where applying migrations takes longer than 60 seconds, Heroku will kill the process with a boot timeout because it has not yet started listening on $PORT.

Our v1.3 workaround was to automatically restart the app if it crashed with a boot timeout and let it apply the rest of the migrations.

Since upgrading to v2.0, this often leaves the database in an inconsistent state because running the migration and updating the version are not done in a transactional manner (Run succeeds, but SetVersion does not get called). This requires manual intervention to mark the last migration as successful with --version=VERSION --skip-execution.

https://github.com/hasura/graphql-engine/blob/5b7d949ef41c37104d4650ccbd1445e678bfbf0e/cli/migrate/migrate.go#L1167-L1177

scriptonist commented 3 years ago

@etrepum Got it, from v2.0 there has been a change in how migrations are handled.

Consider you have the following migrations

Pre 2.0, when you run a command migrate apply what CLI does under the hood is to concatenate the SQL contents of migrations from 1 through 4 and add the appropriate SQL statements corresponding to updates to schema_migrations table where the "state" of migrations are stored ie mark migrations as applied/unapplied. This whole set of SQL (migrations + state updates) are sent in a single run_sql API call to Hasura which is executed in a single transaction.

Starting 2.0, this has changed a bit, which is primarily because of the change in how Hasura sees connected databases/sources. Now if you are applying migrations 1 through 4, each migration will be sent in separate run_sql API calls and the migration state store has changed from the schema_migrations table to Hasura metadata storage layer exposed over a set of API's. So, the state store which stores the migration state and the database on which migrations are applied are different and therefore they cannot be run in a single transaction.

What I've mentioned are the changes that have happened and their side effects. But I definitely understand that these new changes are affecting your workflow/DX. So, we will work on a solution.

etrepum commented 3 years ago

I understand why it’s much less safe now, but that’s not usually the direction you expect software to evolve. The first rule of distributed systems is that a failure can happen on any request.

It certainly could be run in a single transaction in cases where the metadata store and the database were the same, much easier than making distributed transactions work. I’m not sure what alternative there would be without at least creating something like schema_migrations on the target database that could be used to reconcile inconsistencies with the centralized metadata.

jflambert commented 1 year ago

Hi @rikinsk and @scriptonist I absolutely need a way to run all hasura migrations as a single transaction. I'm currently using 2.11. I believe this behaviour was introduced in 2.0 and I need an alternative.

Right now I have to change my deployment strategy from RollingUpdate to Recreate to avoid pods using the database while Hasura updates it in a newer pod. And if it fails, then I'm stuck with a partial upgrade. I want all or nothing.

osdiab commented 1 year ago

Just ran into this in production and that was not expected. We didn’t even abort the migration, instead apparently some endpoint when calling hasura deploy in CI timed out and then our database had the migration SQL applied, but never set the migration as applied. Retried the migration, but it just fails again because the SQL in the migration isn’t idempotent. This makes me not want to use Hasura to handle migrations, as that behavior is frightening; surprising this has been stale for 2 years now.

jflambert commented 1 year ago

yes @osdiab we've stopped using hasura for migrations

osdiab commented 1 year ago

Yeah, I find it bizarre that which Hasura migrations are applied has been moved to the separate metadata store instead of just staying alongside the Postgres database. Not having transactional migrations basically makes Hasura's migration system seem like a toy for not serious projects, and aside for deciding which migrations have been applied or not I fail to see why it would be used for anything else regarding Hasura metadata. I could understand why having the metadata store be separate from the main database makes sense for Hasura as a service that supports many remote data sources, but the migration system is by definition tied to the database you're running migrations on, so separating that out seems bizarre.

hasura / graphql-engine

Aborting `migrate apply` leaves the database in an inconsistent state #7166