pressly / goose

A database migration tool. Supports SQL migrations and Go functions.
http://pressly.github.io/goose/
Other
7.03k stars 517 forks source link

-allow-missing does not seem to be handled properly when followed by a down-to or down command? #402

Open callaingit opened 2 years ago

callaingit commented 2 years ago

Using goose v3.6.1 with mysql

We applied a missing script that we inserted midway using -allow-missing option

| 3020 | 20220729193614 |          1 | 2022-07-29 19:47:52 |
| 3021 | 20220729193639 |          1 | 2022-07-29 19:47:52 |
| 3022 | 20220729193704 |          1 | 2022-07-29 19:47:52 |
| 3026 | 20220801203535 |          1 | 2022-08-02 17:46:23 |
| 3027 | 20220729140600 |          1 | 2022-08-31 14:38:16 | <-- Last row has version 20220729140600, but it's not the highest version that we executed...

Afterwise, the current version (as reported with version) is 20220729140600, which seems a bit odd to us (shall it be 20220801203535 instead)?

Now, when we use goose down-to commands, goose seems to start with that version going down and of course, that does not work as scripts after 20220729140600 will prevent down scripts from 20220729140600 to complete...

We had expected goose to consider the latest version being the last applied row being ordered by latest version_id descending, not id descending (or at least ordered by version_id, id descending).

Seems like the goose reset command is our only hope, but it removes all migrations...

Was this the expected behavior (implies that "down-to 0" is not the same as "reset" if one has run missing migrations)? If so, can this be clarified somewhere in the docs/blog? If not, can it be fixed?

Thanks.

callaingit commented 2 years ago

Note I have read https://pressly.github.io/goose/blog/2021/out-of-order-migrations/, but I don't see content that explains what should happen when mixing -allow-missing with down/down-to commands.

mfridman commented 2 years ago

but I don't see content that explains what should happen when mixing -allow-missing with down/down-to commands.

Thanks for the feedback, we should improve the documentation around that case.

You're observing correct behaviour because down migrations undo previous migrations based on insert id and not by the version_id you supplied. So in your case, I'd expect a down migration like so:

3027 (20220729140600) -> 3026 (20220801203535) -> 3022 (20220729193704)

Because that was the original insert order.

If reset is using the version_id's then it's a bug and we'll fix that because reset and down-to 0 should mean the same thing.

mfridman commented 2 years ago

Is using the serial id which is effectively the insert order in the database causing issues for you?

I'm trying to understand the use case why you'd want to apply down migrations by version_id, which is a different order of operations as the up migrations.

callaingit commented 2 years ago

Hi @mfridman and thanks for you quick response.

Imagine I had migrations 100 and 200. I did goose up, so current version (as reported by goose version) was 200 (and its serial id was 2). Now, we realized we shall better add a script in the middle of our existing migrations (otherwise, why do we have a -allow-missing I don't see, but maybe there is something I'm missing), so we numbered it 150 and then ran goose up with -allow-missing option. It did the job from a DB migration purpose, but now current version (as reported by goose version) was 150 (not 200), which is a bit weird to us and of course, it is because 150 is inserted with serial id 3.

To our understanding, -allow-missing was supposed to allow people to run intermediary scripts that had been added later on without altering the sequence of the existing migration script versions (i.e. with the end result as if the missing scripts were ran in sequence from), which does not seem to be the case. It breaks the down-to 0 = reset principle.

If you run a missing script in the middle of migration numbers with -allow-missing, and then do a down (by 1), would you expect your last script executed to be the first one to downgrade or the script with the highest version number as you can see it in your migration files? To us, it is this last option that we understand to be the correct interpretation to give to that situation.

I may be wrong, but I think users shall not be concerned about the serial ID, which is an implementation detail, while the only number they see and base all their work on is the one in their file names (be it hand crafter numbers, numbers based on timestamps, etc.).

My understanding is that if you have reset work like down-to 0 now, I'm not sure the typical use case for using the -allow-missing, which is to my understanding to be able to add new migrations in between existing migrations, won't work well when people need to downgrade versions for any valid reason after that as the order of the down scripts, based on the version number of the scripts (not the serial ID, who knows them normally)?

Now, if we so goose down-to 0 for example (to tear everything down let's say and recreate the DB from scratch), our whole script sequence fails because 150 sees its down scripts executed before 200, but if we do goose reset it works (does them in the correct order based on version, not serial ID from what we could observe).

Of course, if one's idea is to reset the whole DB, goose reset is ok even if goose down-to 0 is not, but the idea is that the current version shall probably not be the one with the highest serial id, but the one with the highest version field (and if there are two rows with the same version field, then use the highest serial id among those rows as it seems possible to have more than one row with the same version field value given the current DB design). In order words, sorting by (version_id desc, id desc) is probably the way to handle proper script order even when -allow-missing is used.

Hope that helps you understand the issue we had and our understanding of what was expected because of how we interpreted why -allow-missing exists in the first place and how we thought it should work based on the info we had (doc and our scripts + goose command line tool without really understanding all the way the goose_db_version migration table works internally) and what seemed common sense from a migration script writer adding a script "in the middle".

Thanks.

mfridman commented 2 years ago

There are some really good points here, I'll try to address them.

-allow-missing was supposed to allow people to run intermediary scripts that had been added later on without altering the sequence of the existing migration script versions (i.e. with the end result as if the missing scripts were ran in sequence from)

The idea behind -allow-missing was to allow out-of-order migrations. Dev A and B are working on different branches, both containing migrations. Dev B merges their branch first and it contains a higher (out-of-order) migration [4] which gets applied. Then Dev A merges their branch with a lower-numbered migration [3] and it gets applied.

The state of the database is now, 1, 2, 4, 3

If you were to spin up a brand new database and apply these migrations, the order would be 1, 2, 3, 4.

Now, since we're using -allow-missing, the order of up migrations already does not matter since migrations 3 and 4 are independent of each other.

So, if you're operating in an environment where the order of up migrations does not matter, why does the order of down migrations matter?

From an implementation perspective, as you mentioned, there are 2 ways:

  1. down migrations based on file version id: 4 -> 3 -> 2 -> 1
  2. down migrations based on the state of the database (the sequence id): 3 -> 4 -> 2 -> 1

I went with option 2, because it made the most sense in terms of a linear progression of the database state. In other words, you're "undoing" migrations in terms of how they got applied.

Furthermore, option 1 poses another challenge. There is no guarantee in the system that a migration file has been applied yet, so when we migrate down do we silently skip migration files that are out-of-order and haven't been applied? At least with option 2, we're migrating down in the opposite order of the up migrations, which seems desirable (even though it's based on internal database state).

Maybe I got this implementation wrong, and down migrations should be based on option 1, where we use file versions as the source of truth.

callaingit commented 2 years ago

Thanks @mfridman for those explanations. Still not clear to me what should be done and the reset goose command does it in 4 -> 3 -> 2 -> 1 order from what I saw (which is also another point in our discussion, namely the reset = down-to 0 thing that seems widely accepted in general).

To me, up or down, the file sequence of migration files should prevail because one does not know in which order things could have been done at the end of the day when looking at migration files only and cannot assume, when writing scripts, anything else than file sequence of the migration files and this is the context from which I tried to understand --allow-missing.

Maybe it all really depends on the semantics of the --allow-missing and the exact use cases it can/was intended to cover. I think if the implementation remains the same, at least some usage warnings on the --allow-missing feature would be more than welcomed so we avoid cases it was not designed to cover, which I have obviously got into.

callaingit commented 2 years ago

Hi @mfridman, it just came to me, what if a goose reset-to feature was added to use the alternate downgrade path you mentioned as the other implementation option you had originally?

This way goose down and goose down-to would work the way they do now and goose reset would work keep its behavior too (which currently differ from down-to 0 in some cases from our discussion) and goose reset-to would work in the way reset does, without breaking ways are working now. The difference between goose reset (and reset-to) vs goose down (and down-to) could be explained in documentation (especially vs the use of--allow-missing) and there could also be a goose highest-version command added to show the highest migration sequence applied from migration sequence numbers instead of the last one applied regardless of its migration sequence number?

I am sure I don't understand all the inner workings of goose, but you may be better positioned in that regard to answer if this suggestion could it be a way out to satisfy all use cases from --allow-missing without breaking everything else we already have from goose?

callaingit commented 2 years ago

Also, for the benefit of all, here is roughly what happened in our case (forgive me the pseudo-SQL):

100.sql

-- +goose NO TRANSACTION

-- +goose Up

create table job_names (
  id char(8) character set ascii collate ascii_bin not null,
  is_suspended boolean not null default false,
  created_at datetime(3) not null,
  constraint pk_jn primary key (id)
);

-- +goose Down
drop table if exists job_names;

200.sql

-- +goose NO TRANSACTION

-- +goose Up
create table jobs (
  id bigint not null auto_increment,
  job_name_id char(8) character set ascii collate ascii_bin not null,
  created_at datetime(3) not null,
  job_status_id char(1) character set ascii collate ascii_bin,
  ended_at datetime(3),
  constraint pk_j primary key (id),
  constraint fk_jn foreign key (job_name_id) references job_names (id)
) auto_increment = 100000;

-- +goose Down
drop table if exists jobs;

now 150.sql was added and applied it with --allow-missing

-- +goose Up
insert into job_names (id, is_suspended, created_at) values ('JOBABC', false, now(3));

-- +goose Down
delete from job_names where id = 'JOBABC';

Later on, our application did add jobs rows with jobname 'JOBABC', which is 100% normal (else why have those tables) as migrations are downgraded when there is possibly data added at any time by apps on top of the DB.

Then we wanted to revert back to version 0 with goose down-to 0 (but we could have wanted to do goose down-to 100 instead from my understanding), the down script 150.sql was considered first and would not work of course, as rows into jobs with jobname = 'JOBABC' would fail to be deleted because of the FK from jobs to jobnames. Things would have worked if downgrade path was 200.sql, then 150.sql, then 100.sql were applied. We were surprised the downgrade path would start from 150 instead of 200 and that goose version reported 150 (the latest version applied, not the highest applied) and not 200 (200 was the highest version applied, not the latest applied)

From a practical standpoint, our scripts were perfectly fine when downgraded by their version number, even if data was added to the database at some point, but because of the order in which down-to was considering downgrading our scripts, it failed.

We had to use goose reset (which hopefully uses the correct migration script order from what we can see from a goose end-user perspective), which started downgrading from 200.sql, then 150.sql, then 100.sql, and this is what started up this whole issue for us. So this is how one example of how we can get into serious trouble when using --allow-missing combined with a later goose down-to command.

mfridman commented 2 years ago

This way goose down and goose down-to would work the way they do now and goose reset would work keep its behavior too (which currently differ from down-to 0 in some cases from our discussion) and goose reset-to would work in the way reset does, without breaking ways are working now

That's not a bad idea. We typically like to build tools that are simple and have one way of doing things. But the introduction of out-of-order migrations naturally introduces different ways of doing things as you've described above. For down migrations, I don't think there is a one-size-fits-all, because some may want to down based on the database order and others (like yourself) will want to down based on the file order.

As you mentioned, goose down-to 0 and goose reset are not equal when -allow-missing is used to apply up migrations. (historically they were because out-of-order migrations were never supported until fairly recently).

I'll put some thought into how this issue, although I'm a bit short on cycles at the moment so can't promise an immediate solution.

@callaingit, when down migrating, should goose care about migration files that haven't been applied? If yes, should goose skip that file silently (maybe log a warning) or should goose fail hard?

If no, I assume you'd just expect that down migration to be applied regardless?

The whole point of goose down|reset is to apply down migrations for migration files that have been applied. To re-iterate, if using the migration files, there is no guarantee those migrations have been applied, so I'm trying to figure out the expected behaviour.

callaingit commented 2 years ago

Hi @mfridman , I'am glad we can already have a discussion around all that and I understand there could be no immediate solution as this requires more than a bit of thinking and of course, coding and testing.

Seems like we're into some somewhat difficult decisions and questions on this topic because of the implications which are not always easy to foresee, but here is my 2 cents:

I feel like adding a goose reset-to in line with reset semantics won't break anything existing, which is a big plus vs changing any existing behavior without notice (unless it would have been wrong all the time). I also feel adding a goose highest-version in addition to goose version (which has actually semantics like a goose latest-applied-version) also won't break things around. Of course, this induces new things to undertstand, maintain and document, but if it helps cover some --allow-missing cases that the current behavior fails to address, it may be worth it.

My understanding is that if a migration file has not been applied, goose reset (or the goose reset-to hypothetical feature we're discussing here above) shall not consider it (maybe unless --allow-missing is specified, which I think would be in line with its --allow-missing semantics). Of course warnings to explain such skipping behavior when it occurs is probably a plus so one can understand what actually happenned from the logs. I may be wrong, but skipping a migration that has not been applied is probably ok as it should not require any undo of DB operations that actually never happened

Maybe it would be wise to get some extra feedback from other brains here before concluding on that, unless you get some extra clear insights when you'll spend dedicated time on that, but that is what I can think of when trying to answer your question about what to do with migration files not applied during a downgrade operation (within goose reset downgrade path semantics, not goose down downgrade path semantics).

mfridman commented 1 year ago

Thanks for all the context and fruitful discussion, and apologies for the delayed response, partially due to work and partially because I didn't have a better answer and wanted to "sleep on it" while gathering feedback.

After more research, thought (this issue has been on my mind since last September 😅 ) and chatting with folks, I came to the realization there won't be a single implementation that will satisfy all users.

I think one approach without adding additional commands would be to expose a set of flags that allows the caller to specify the "strategy", e.g., in the examples above you'd want to use something like --by-version whereas I'd like to use --by-id (default).

To use a concrete example, say we have already applied migrations 1,2,4,3

With goose down --by-version they will be migrated down 4,3,2,1 whereas --by-id (default) will be migrated down in reverse order they were originally applied 3,4,2,1

This only applies to migrations applied out-of-order. If the above flags are used by a project that does not allow missing migrations, then they'll effectively be a noop (because both id and version_id are guaranteed to be sequential).

The reset command would also take this flag and feed it to the --down-to 0 command, so they'll have parity.

That leaves the version command. If using the example above goose version --by-id will display 3 whereas --by-version will display 4.

There is also goose status, and there is probably some thinking around that command. TBD.