pressly / goose

A database migration tool. Supports SQL migrations and Go functions.
http://pressly.github.io/goose/
Other
7.15k stars 523 forks source link

"NO TRANSACTION" migrations should mark database as dirty if not successful #728

Open tumelowill opened 8 months ago

tumelowill commented 8 months ago

Migrations that do not have a transaction will not be rolled back. We should mark the database as dirty if the migration failed. Migrations on dirty databases should not be allowed.

Example:

-- 1_create_posts_table.sql
-- +goose Up
CREATE TABLE posts (
  id SERIAL PRIMARY KEY,
  title VARCHAR(255),
  content TEXT,
  author_id INTEGER
);
-- 2_update_posts.sql
-- +goose NO TRANSACTION
-- +goose Up
INSERT INTO posts (title, content, author_id) VALUES ('Hello World', 'This is my first post!', 1);
CREATE INDEX CONCURRENTLY idx_created_at ON posts (created_at); -- This fails as the field created_at does not exist on the posts table

If we have a policy of running migrations when the database schema is outdated, we are now filling the posts table with inserts on each migration run as we have no idea whether the database is dirty.

What we would like to see after attempting to migrate multiple times:

postgres=# select * from public.posts;
 id |    title    |        content         | author_id 
----+-------------+------------------------+-----------
  1 | Hello World | This is my first post! |         1
(1 row)

postgres=# select * from public.goose_db_version;
 id | version_id | is_applied |           tstamp           
----+------------+------------+----------------------------
  1 |          0 | t          | 2024-03-21 14:33:00.782058
  2 |          1 | f          | 2024-03-21 15:08:48.144995
(2 rows)

What we actually see after attempting to migrate multiple times:

postgres=# select * from public.posts;
 id |    title    |        content         | author_id 
----+-------------+------------------------+-----------
  1 | Hello World | This is my first post! |         1
  2 | Hello World | This is my first post! |         1
  3 | Hello World | This is my first post! |         1
  4 | Hello World | This is my first post! |         1
  5 | Hello World | This is my first post! |         1
(5 rows)

postgres=# select * from public.goose_db_version;
 id | version_id | is_applied |           tstamp           
----+------------+------------+----------------------------
  1 |          0 | t          | 2024-03-21 14:33:00.782058
(1 row)
mfridman commented 7 months ago

That's correct, this is working as intended. However, I can see this may not be suitable for some applications.

I've used several tools that leave the database in a "dirty" state and when you're applying migrations on dozens, or even hundreds of systems recovering becomes very challenging.

When we do have to use +goose NO TRANSACTION we typically isolate those changes to a single file and only down to the statements that truly must run outside a transaction. In practice, for most databases, the number of statements that need to run outside a transaction is very low. So low, that I'd rather not add additional complexity and overhead to handle these edge cases.

tumelowill commented 7 months ago

Interesting thanks. I'm curious how you think we should handle a failed long running migration that involves batch inserting data from a large table into another. Using the Goose's API, I don't think we have a way of knowing whether this migration was previously attempted and cannot stop ourselves from duplicating the work that we had already done. Of course this is something we can write tooling for ourselves to guard against, but it could be nice to have this given to us from Goose directly.

mfridman commented 7 months ago

I don't have a good answer for this at the moment.

The times I've had to run migrations that lasted a few hours, we treated those as exceptions, i.e., apply all the migrations up to that point and then run the slow migration in isolation (monitoring it more closely).

When possible, we wrote those migrations to be idempotent, so that if they were interrupted, they could be run again without causing any issues. But obviously, that's not always possible.

In other places, I've resorted to doing "data migrations" in out-of-band jobs without goose, e.g., marking a feature as "no ready" until some data migration or backfill is complete. But the downside is that you must write your own logic to track this.

I'd like to keep this issue open for now and do a bit more thinking. Fwiw the is_applied column in goose is a legacy thing that tracked down migrations, but we decided to remove that (initial https://github.com/pressly/goose/issues/121#issuecomment-434717659)