Add a new release script to wait_for_migrations

mac-chaffee commented 3 months ago

Problem

When deploying to environments with multiple replicas of a phoenix app (like in Kubernetes), there's a common problem where deploying a new instance of the app requires waiting for migrations to complete. The problem is described in-depth here: https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-7-running-database-migrations/

To summarize: When you have one instance of a phoenix app, it would be simple to just run migrate before the app starts up (like in an initContainer). But if you have two instances of a phoenix app (for high availability), you would run the migrations separate from the app startup process (like in a Kubernetes Job) so you don't encounter race conditions from two different processes trying to apply the same migrations. But now you need to somehow tell the apps to wait for the migrations to complete, or else the new app may start up and try to access a column that doesn't exist yet.

Solution

This PR adds a wait_for_migrations release script that essentially checks mix ecto.migrations every 5 seconds until it shows all have been applied.

You can see an example of this solution in practice here: https://gitlab.com/mac-chaffee/crowdsort/-/tree/master/chart/crowdsort/templates

Questions

The wait_for_migrations function is a bit complex and could use tests, but wasn't sure where to put them
Should this even be in phoenix? perhaps wait_for_migrations should be a part of Ecto and Phoenix just calls their function? Or maybe we define the function deeper inside phoenix so we keep release.ex minimal?
Haven't touched elixir in 4 years, so my code may be terrible!
I didn't make the timeouts configurable, but they probably should be. What's the best way to load that config in this file?
Also using IO.puts, wasn't sure if logging is accessible to this script or not

Open to all feedback!

SteffenDE commented 3 months ago

But if you have two instances of a phoenix app (for high availability), you would run the migrations separate from the app startup process (like in a Kubernetes Job) so you don't encounter race conditions from two different processes trying to apply the same migrations. But now you need to somehow tell the apps to wait for the migrations to complete, or else the new app may start up and try to access a column that doesn't exist yet.

Is this really an issue? I‘m having multiple Phoenix apps running 2+ nodes all trying to apply migrations. Because of the migration lock only one actually applies the migrations, while the others wait for the lock to then realise that migrations are already up. So this approach might be overcomplicating things?

mac-chaffee commented 3 months ago

Oh, sure enough you're right: https://github.com/elixir-ecto/ecto_sql/blob/48fc2ad6e8afb022f8454350e23122c3304451d1/lib/ecto/migrator.ex#L399

In order to run migrations, at least two database connections are necessary. One is used to lock the "schema_migrations" table and the other one to effectively run the migrations. This allows multiple nodes to run migrations at the same time, but guarantee that only one of them will effectively migrate the database.

I was trusting this line from the linked article at face value:

if we have 3 replicas, and we try and perform a rolling update then we may end up with multiple applications trying to migrate the database at the same time. This is unsupported in every migration tool I know of, and carries the risk of data corruption.

So using an initContainer for migrations would indeed work in Kubernetes. Thanks!

phoenixframework / phoenix