Open Mythra opened 2 years ago
I, too, have been watching the migrations/
folders in our projects grow without end and wondering if we should add consolidation support and what that might look like.
To support modern deployment practices, I'd like the actual database-side changes to be represented in the migrations/
directory so they can be embedded and executed by the application itself.
This is what I'm thinking:
migrations/.old/
and create new migrations in migrations/
that consolidate the changes in the old migrations into however many SQL files they want. We can create a CLI command to help with this.migrations/.old/BASELINE
(bikeshedding?) file containing the highest version number in the new migrations that needs to be applied to match the state of the old migrations.sqlx::migrate!()
macro would embed both the old migration set and the new migration set.Then the migration machinery would handle it like so:
sqlx-cli
or Migrator::run()
, the migration machinery would first check if the database is using the old migration set or the new one by comparing the contents of _sqlx_migrations
to both migration sets; looking at just the first migration should be sufficient. If neither matches then it returns an error._sqlx_migrations
table and record the new migrations up through the version number set in migrations/.old/BASELINE
as being successfully applied.Once the database has been successfully migrated, the user may delete the .old
folder. The application will then exclusively use the new migration set.
To make this migration minimally error-prone, we can provide a couple helper CLI commands.
migrations/
folder. If I had to state a preference, I think it should output a separate file per table/object as that's normally how I tend to organize baseline migrations but it could have a flag to consolidate into a single file instead. Or maybe it could start in interactive mode by default and prompt the user for that option.CREATE DATABASE
statement they didn't ask for.BASELINE
could also include a checked
flag (I would then make the file TOML format) that starts false
, and sqlx migrate run
could refuse to run and sqlx::migrate!()
refuse to compile unless checked = true
. The user could then either run the command which would change the flag if successful, or just change it manually to override.A coworker brought up the potential for ambiguities where whole or part of the new migration set matches the old set (maybe some files were copied verbatim).
I think a more robust logic is that the migration machinery should check if the current state of the migrations table matches the new migration set first, and if so, just treat the new migration set as the canonical set. If there's a conflict with the new migration set, it then falls back to upgrading from the old migration set. This will produce a nice fast path for databases that have already been upgraded.
Thanks for the comments here! I think a lot of this makes sense to me after thinking about it throughout the night. The only thing I'm quite not sold on yet is the need for a migrations/.old/BASELINE
. I at least naively would've imagined the workflow to being something like:
migrations/.old
, and run (excluding bikeshedding): sqlx migrate new-baseline
, this gives me a set of migration files (I'm one of those people who want it all in one file rather than one per table, though I could imagine that being useful too as an option). It's timestamp is the maximum of all the timestamps in .old
. There should be nothing left in the migrations folder that has a timestamp lower than my baseline..old
with what i want to be my new baseline. Then relying on SCM for the extra extra old files.The baseline file becomes obsolete in this model as far as I can tell (though I could very easily be missing something!), and in general not-needed. I guess it makes it so that you don't have to traverse the .old
folder in order to find the maximum timestamp, but I imagine most applications being able to take the fast path of not even needing to look at the old folder. And when they do, it be a one time thing (they notice they have a migration timestamp smaller than their oldest present in the database).
It's timestamp is the maximum of all the timestamps in
.old
.
While this is clever, I don't want to place a restriction on version numbers. In fact, I rather like using integers starting from 1 for the bootstrapping migrations on a project, and then timestamps for post-bootstrap or post-MVP migrations. It can be quite useful for retrospectives to compare the initial vision with where it ended up; a large number of post-MVP migrations might indicate a project with a lot of feature creep or an incomplete original specification. I would similarly use 1-based versions for the consolidated migrations and then go from there.
The BASELINE
file is thus necessary if there's no correlation between the old and new version numbers. A command to generate it would be pretty simple though. I don't just want to assume that the old and new migration sets are always equal because you could have development concurrent to the consolidation that's adding new features, and the migration machinery would need to know to run those new migrations and not just pretend they've already been run.
Simply concatenating all the existing migrations into a single file is already trivial to do with a few shell commands so I'm not sure a special CLI command is necessary:
mkdir migrations/.old
mv migrations/*.sql migrations/.old/
# This is exactly what the `cat` command was designed for.
cat migrations/.old/*.sql > migrations/12345__new_baseline.sql
I wouldn't be content with that though, as you'd still be left with a mess of redundant ALTER ...
statements that should really be collapsed into their original CREATE ...
statements for readability, as well as removing any intermediate DML that may have been required for data consistency in the old migrations but which would obviously be wasted on a fresh database.
While this is clever, I don't want to place a restriction on version numbers. In fact, I rather like using integers starting from 1 for the bootstrapping migrations on a project, and then timestamps for post-bootstrap or post-MVP migrations.
Thanks for the follow up! I've been thinking about this for a few more days, and I'm still slightly confused as to the purpose of the baseline file. Sorry if I'm being slightly dense, I'm just going to list everything I think about (to make sure we're on the same page, and hopefully to make it clear to point out what I'm missing), here's my case:
In your case it sounds like your case is more:
In either situation, I don't see the benefits of a BASELINE file. I only see problems that could happen. Mainly because SQLx already has an ordering. It has to know what order to apply migration files in. If I start a new database, and have a folder with 10 unapplied migrations SQLx has to use the number at the beginning (which I've always thought of as a timestamp, but knowing it can be a number lets call it a number).
In my case having a baseline file not only adds extra work to me a maintainer (as an extra file I have to deal with through the process), but adds potential bugs in the case of "what happens when I move migrations with the numbers 1,2,4 into the old folder, but not 3?", "what happens if the baseline number doesn't exist in the folder at all?", etc. Hopefully these cases wouldn't happen, they seem pretty far-fetched, but why have the opportunity for them at all? In your case, there's also no need for a baseline file because all files are now becoming the new baseline, and you'll add on from there.
Even if we were to combine your requirements with mine, let's say having a series of migrations using version numbers 1
, 2
, etc. wanting to change them to timestamp versioning, but not merge in all of them. There still is going to exist an explicit ordering before, and after. For example if I have the migration numbers 1..4 I might end up with: 20220716045441_baseline_was123.sql
, and 20220716045442_previously4.sql
.
There is always going to be an ordering, because of that regardless of what versioning scheme you use, or if you change it... we always need to know what order to apply migrations in, and because of that we can always infer BASELINE
(and will have problems creating a baseline that mixes that ordering up)! Even if we had a long-lived feature branch when the versioning scheme changed. Like let's say I had migrations in my main branch 1
, 2
, 3
, and 4
is in another branch. I create a baseline file 3
, and tell SQLx to start using timestamp ordering so I now have: 20220716045441_baseline.sql
, in my main branch. My other branch still has 4
which I would need to manually rewrite, to a timestamp that comes after baseline
. I gain no benefit here from using a baseline file.
I wouldn't be content with that though, as you'd still be left with a mess of redundant ALTER ... statements that should really be collapsed into their original CREATE ... statements for readability, as well as removing any intermediate DML that may have been required for data consistency in the old migrations but which would obviously be wasted on a fresh database.
Yeah it's this point exactly. This nails it on the head. Especially when you consider I'd have to manually futz with the _sqlx_migrations
table on my existing databases to get it to show the proper new version numbers, and not show the old ones.
@Mythra You misunderstood them, they don't want to switch to only using timestamp-based versioning, they want to be able to use the same scheme of starting from version 1 for the collapsed migrations as well. So if they currently have a set of migrations [1, 2, 3, ..., 20220909, ...]
they want to be able to collapse all of those into a new set of [1, 2, 3, ...]
to be able to do the same kinds of change comparisons after the collapse.
But even if we don't take that use-case into account a BASELINE
file would still be necessary if you want to be able to support collapsing the history into multiple migrations.
If you don't have something that says which of the new migrations constitute the same final schema as the old migrations there is no way to safely know which migrations the migrator should just add to the migration history table without actually applying them.
Thank you @kyrias for explaining! I think I roughly understand now, sorry it took me so long to get and polluting this issue 😅 !
It may be sufficient to allow users to manually combine some of their older migration files, folks can use --ignore-missing
, and then the only thing which needs to be done is to add something like --update-combined
or --update-checksums
to the run
command, depending on what folks here would prefer. I'm thinking something like:
--update-combined
: any missing migration files will be removed from _sqlx_migrations
, and existing files will have their checksums updated.--update-checksums
: will update the checksums of any files which have changed (due to the user manually combining files.I think there are lots of variations on the above, and it is really quite subjective. However I've used the above pattern quite successfully in the past. Would be great to be able to do the same thing with sqlx.
Is your feature request related to a problem? Please describe.
In a project I'm working on our migrations are meant to be fairly small, usually added with very specific features. For example when adding in web-push notification support, a reversible-migration that adds a column for a list of FCM Device Tokens, and their nickname. While this workflow is really great, and makes rollbacks easy, it also leads to having tons of migrations.
This makes discovery, and navigation when not connected to a database (where we can use the comments on tables/columns to navigate) difficult. Not to mention slowing down sites like GitHub/Editors/etc. whenever you need to open the gigantic "migrations" folder that has too many things in it.
Describe the solution you'd like
I'd like to every so often, when our migrations folder becomes incredibly large to be able to collapse the migrations folder with N number of migrations, to a single migration (or a few if some are still new and we may want to keep their small migrations for rollback separate).
This will allow us to keep the migrations folder small "stamping" them once they become stable enough to a file with the current schema, and it's reverse for unapplying, while still maintaining the benefits of the workflow we have currently.
Describe alternatives you've considered