Database schema and data migrations

### Tasks
- [ ] Configure Database Migrations
- [ ] Support running migrations in code

Problem

In .NET Entity Framework, there can be several database schema migrations. However, between each Database migration, there might be the need to have data migration don by code code.Example:

Schema migration: Add new FirstName and LastName columns to the Users table that allows null
Data migration that populates the new columns from the existing Users.DisplayName column using C# code
Schema migration: Remove the null check and delete the old DisplayName column

This is just a simple example. The solution should handle that many complex migrations are run in a certain order, if a system has not been updated for some time (e.g. Schema, Data, Schema, Schema, Data, Schema, Data).

How can I create a good migration strategy that runs as part of the Application startup?

Solution 1

Use SQL .dbproj and DacPac: Power your DevOps Development with SQL Server Containers

Solution 2

Managing database migrations and code transformations in a sequential and reliable manner is indeed an important aspect of maintaining a robust application, especially in a complex domain-driven design project. Your problem is about orchestrating these operations in a way that ensures database integrity and data consistency.

For .NET and Entity Framework, you have a tool called EF Migrations which can handle database schema changes. However, EF Migrations doesn't inherently support code migrations that transform data between schema migrations.

Here is an approach you might consider:

Separate Schema Migrations and Data Migrations: Separate schema migrations (handled by EF Migrations) and data migrations (custom scripts for data transformations).
Sequential Naming Convention: Adopt a naming convention for your migrations that includes a sequence number (e.g., 001_AddColumns, 002_PopulateColumns, 003_RemoveNullCheck). This will ensure that migrations are executed in the correct order.
Custom Migration Runner: Implement a custom migration runner that executes at application startup. This runner should:

Run the EF Migrations up to the point where a data migration is needed.
Run the corresponding data migration scripts.
Continue with the rest of the EF Migrations.

This could be achieved by using the DbContext.Database.Migrate() method to apply the EF Migrations and a custom method to execute the data migrations. The data migration scripts could be SQL scripts or C# scripts depending on the complexity of the data transformations.
1. Idempotent Migrations: Make sure your migrations are idempotent - that is, they can be run multiple times without causing issues. This is important as migrations might fail and need to be rerun.
2. Test Your Migrations: Always test your migrations, ideally in an environment that mirrors your production environment.

This solution allows you to manage both schema migrations and data transformations in a sequential manner. It also makes sure the order of operations is preserved when you deploy to different environments.

Remember that handling migrations this way could potentially cause longer downtimes during deployment, especially for large data migrations. So, you need to plan your deployments carefully.

Solution 3

A more naïve approach would be to run migration as part of the application startup. If not done carefully this might cause race conditions. One way to avoid Race conditions is to use "Init containers":

Generally available: Init containers in Azure Container Apps Published date: August 16, 2023 The init containers feature in Azure Container Apps is now generally available. Init containers are specialized containers that run to completion before application containers are started in a replica, and they can contain utilities or setup scripts not present in your container app image. Init containers are useful for performing initialization logic such as setting up accounts, running setup scripts, and configuring databases.

To learn more, visit: https://aka.ms/aca/init-containers

DbUp is a great alternative solution to the EF migrations problem (schema, data, schema) since the migration scripts are executed in-order and are only applied once (schemaversions). integrating a migrations run at startup / build pipeline is as simple as

var internalScriptName = scriptName;
var db = scriptName.Split('-')[3];
var connectionString = ConfigurationManager.ConnectionStrings[db];
var upgrader = DeployChanges.To
    .PostgresqlDatabase(connectionString.ConnectionString)
    .WithExecutionTimeout(new TimeSpan(0, executionTimeoutMinutes, 0))
    .WithScriptsEmbeddedInAssembly(Assembly.GetExecutingAssembly(), s => s == internalScriptName)
    .LogToConsole()
    .Build();

var result = upgrader.PerformUpgrade();

would this be a viable solution alternative?

Thanks for the feedback/suggestion.

I've used DbUp extensively over the past 4 years. It is certainly an alternative, but it can be confusing to understand how your DB schema looks when you have made hundreds of migrations. It also is not trivial to work with when you do Disaster Recovery, as you need full control over which version your schema was in. And your backups might be different (unless you do full database restore when doing Disaster Recovery, where the migration history table is in sync).

Like EF migrations, it requires a deep understanding by everyone using these migrations, but most people think they are trivial because they only test the happy path. I've seen people go back and change old scripts because it was too slow to run on staging, then another slightly different version was deployed to production, essentially making the active schema ever so slightly different between staging and production (in terms of indexes). A bomb waiting to explode.

Personally, I like DacPac as it's declarative, and you have tooling to modify your database schema. Basically, you have a .dbproj that defines how you want your database to look, and the tooling generates the diff, which will ensure that no drifting happens. The scripts generated can do complex transactional migrations. However, this approach has limited support for "schema, code, schema" migrations. But so does DbUp and EF Migrations.

I'm starting to lean towards doing database migrations as part of CI/CD. It will make it clear for everyone that all database migrations must be at least one version backward compatible (so you can roll back, and so rolling deployments do not cause problems). Also, database migrations are much more likely to cause production problems, in the same way as infrastructure changes (Bicep, Terraform). So, they need more governance than normal code deployments.

We are still trying to make up our mind around this. Until we find something noticeably better, we'll stick with EF migrations.

I would like to contribute on this topic as I have done a lot of research on SQL Schema and I would like to discuss and get feedback on this especially in particular to modular migrations that are reliable for different types of B2B/B2C SaaS systems.

Hi Casper,

Cool. Would you have time for a short introduction (online)?

As written above, I have some ideas on how to build this with DacPac in CI/CD workflows. But it's not completely set in stone.

We are currently using EF Core migrations, which are fine for now, but they are run as part of startup, which makes them, slow, fragile, and susceptible to race conditions. If a migration fails, there is an incident.

I've used DacPac before, but not in CI/CD, so we will have to do research to ensure it is reliable in both continuous deployment scenarios, and in disaster recovery (where you might restore a database that is several versions old).

A core part of our principles is that "application code" is deployed with continuous deployment (push on green), compared to continuous delivery (manual approval). See https://www.atlassian.com/continuous-delivery/principles/continuous-integration-vs-delivery-vs-deployment:

Database code is infrastructure and is deployed with continuous delivery (manual approval). So, I fear the scenario where a developer makes a pull-request with both database changes and code changes. In our setup, the application code will always hit production before the database changes. Somehow, we need to protect ourselves against this.

If you have experience with or ideas on how to set this up in a robust, clean, and developer-friendly way, I would love to talk. To be honest, I have not seen anyone solve this problem in a good way :)

Hi Casper,

Cool. Would you have time for a short introduction (online)?

As written above, I have some ideas on how to build this with DacPac in CI/CD workflows. But it's not completely set in stone.

We are currently using EF Core migrations, which are fine for now, but they are run as part of startup, which makes them, slow, fragile, and susceptible to race conditions. If a migration fails, there is an incident.

I've used DacPac before, but not in CI/CD, so we will have to do research to ensure it is reliable in both continuous deployment scenarios, and in disaster recovery (where you might restore a database that is several versions old).

A core part of our principles is that "application code" is deployed with continuous deployment (push on green), compared to continuous delivery (manual approval). See https://www.atlassian.com/continuous-delivery/principles/continuous-integration-vs-delivery-vs-deployment:

Database code is infrastructure and is deployed with continuous delivery (manual approval). So, I fear the scenario where a developer makes a pull-request with both database changes and code changes. In our setup, the application code will always hit production before the database changes. Somehow, we need to protect ourselves against this.

If you have experience with or ideas on how to set this up in a robust, clean, and developer-friendly way, I would love to talk. To be honest, I have not seen anyone solve this problem in a good way :)

Hey Thomas,

I think I can find some time for an online meeting about this topic as I think it is quite important and I would like to understand your points of view on the matter a bit better.

If you could connect with me via my github profile for example on LinkedIn, I think that would be the easiest to schedule a meeting 😀.

platformplatform / PlatformPlatform