pressly / goose

A database migration tool. Supports SQL migrations and Go functions.
http://pressly.github.io/goose/
Other
6.92k stars 510 forks source link

Ability to add repeatable SQL migrations #472

Open mcosta74 opened 1 year ago

mcosta74 commented 1 year ago

This is a feature request and the idea come from this feature from the tern package.

In general would be nice to be able to reuse some SQL statements instead of copying them across multiple migration files.

The most common use case is when is needed to modify existing views/functions and in the "down" part we want to restore the previous version.

for instance in shared/my_view_v1.sql and shared/my_ view_v2.sql I can have two versions of the same view and then I can create two migrations

migrations/00001_create_my_view.sql

-- +goose Up

-- +include shared/my_view_v1.sql

-- +goose Down

DROP VIEW my_view;

migrations/00002_update_my_view.sql

-- +goose Up

-- +include shared/my_view_v2.sql

-- +goose Down

-- +include shared/my_view_v1.sql
mfridman commented 1 year ago

This sounds very reasonable. Most migration tools have the concept of "repeatable" migrations and are quite useful for matviews and stored procedures.

The downside with goose is we don't store the checksum of the file (there is no database model for it), so we can't do nice things like "only apply this repeatable thing when it has changed".

The best we can do is apply all migrations and then iterate over the repeatable migrations.

I'm a bit hesitant to add another annotation that points to a file, but it's definitely one approach we can explore.

What do you think if there was a way to tell goose about a "repeatable" directory, similar to how we point at a -dir for migrations, but instead we point to another directory that contains all the repeatable migrations?

mcosta74 commented 1 year ago

TBH in other tools this is demanded to the migration's author aka "If you reuse some SQL code, it's your responsibility to guarantee that you include the right code"

for me would be more than enough just to "allow" to reuse some SQL files without any control on their content

mcgaw commented 1 year ago

@mfridman I recently got burned when updating a stored procedure, by not basing my change on the latest version. I notice flyway, for example, let's you have a repeatable migration - like you've alluded to. I think something along those lines would be a great addition to the tool.

It would also make merge requests much more readable, because at the moment you can't quickly see how a stored procedure has changed.

mfridman commented 1 year ago

Ugh, this issue is starting to come up over and over (for me and the various projects I work on). Having to re-create views every time the underlying models change is quite annoying, would like to add a +1 to get this feature implemented.

mcgaw commented 1 year ago

This is probably only sensible if some kind of hash of the file content is stored in the goose table. Unless the hashes are computed on the fly to compare what's in the DB with the file contents. From what I can see (MySQL) it stores the contents of the CREATE PROCEDURE statement verbatim, but I doubt this would be the same for every DB.

mfridman commented 11 months ago

One of these approaches may be reasonable. I'm not crazy about the template approach because it hides some of the details, and I'd personally prefer "at a glance" to see which migrations are repeatable and have special meaning.

In either case, we need to inform goose that "these sets of files" are handled differently, and should:

  1. Always be applied (after migrations have run)
  2. In a particular order
  3. Not be versioned

For now, let's ignore the checksum and assume these will always be applied.


Option 1 - special file name

Allow a repeat_ prefix preceding the numeric component.

migrations/
├── 00001_users_table.sql
├── 00002_posts_table.sql
├── 00003_comments_table.sql
├── 00004_insert_data.sql
└── repeat_00001_posts_view.sql <---

The downside of this is upstream tools that read the goose migrations (like sqlc-dev/sqlc) must not be aware of these. Granted you typically wouldn't have schema changes and so it should be okay.

Option 2 - special directory

Allow the user to create a separate directory with repeatable migrations.

migrations/
├── 00001_users_table.sql
├── 00002_posts_table.sql
├── 00003_comments_table.sql
├── 00004_insert_data.sql
└── repeat
    └── 00001_posts_view.sql <---

The downside of this approach is it'd require the user to supply 2 directories, e.g.,

goose up -dir=migrations -repeat-dir=migrations/repeat

Also, this would make constructing the provider more involved, e.g.,

goose.NewProvider(
  goose.DialectPostgres,
  db,
  os.DirFS("migrations"),
  goose.WithRepeat(os.DirFS("migrations/repeat))

Option 3 - other?

Happy to hear suggestions for another implementation.

Other thoughts

We also need to think about how Go migrations could be made repeatable. I suspect we could make a Migration with a repeat bool to indicate it should fall into the repeatable bucket.

And then extend the GoFunc struct to also support repeat bool. E.g.,

NewGoMigration(
    1,
    &GoFunc{RunTx: runTx, Repeat: true}, // UP
    nil,                                 // DOWN
),
rquadling commented 5 months ago

I've been pointed here by https://github.com/pressly/goose/issues/725#issuecomment-2079906141.

FYI: I don't use Goose. My background is PHP, but the team I work with is actively migrating various projects from PHP to Go and so also moving from Phinx to Goose.

A quick background on what Phinx has and hasn't got.

Firstly, migrations. There are several questions that came out of this and we used to have a "renumbering" procss to realign the execution order and rollback order (until I fixed Phinx by added an option for rollback to be reverse execution order and not reverse creation order).

Secondly seeds. Effectively ad-hoc tasks that don't fit any automatable timeframe. The devs put in them what they want/need and get them run when they need to.

All of that is pretty much matching Goose.

The main thing here is that, in the main, migrations are being used to talk to an RDBMS (just to note, it is entirely feasible to use migrations and seeds and not have an RDBMS in sight! But generally ... RDMBS). Mainly to create the schema using terms like "CREATE TABLE", "ALTER TABLE", etc. Each statement is capable of altering the schema in a consistent, incremental, reversible, and repeatable manner. From my team, there is a convention that a migration must allow the existing deployed application to run without hinderance. And code requiring the migration is expected to be deployed after the completion of the migration. So much so, we made up a term! ProMiReD. You can Google that!

Unfortunately, to my knowledge, not all aspects of an RDBMS can be altered in the same manner.

Specifically, this involves views, triggers, stored procedures, user defined functions, and events. There may be other types on other RDBMS system so do not take this as exhaustive as an exhaustive list.

If you are a sole developer, and are not using branches in your development, then this may all seem sort of wrong. As a sole developer with no branches, you have full control of all aspects of the RDBMS and so if you want to add a column to a view, you add the column to the view, run the appropriate SQL to replace the view, and carry on. That's fine.

But I, and many others, do not work like that.

Even with just 2 branches, a migration system is not the right tool to use. The main reason why not is that the RDBMS does not allow for the incremental adjustment of a view, stored procedure, etc. Instead you have to have a full replacement of the schema element.

By the time you've got 10 branches of code in development, each wanting to modify a view in some way, migrations have no mechanism for handling this.

And so I created a system called Repeatables.

This is how it works.

We treat view, stored procedures, etc., as version controlled code in our VCS. Just like any other code really. A directory called repeateables has sub-directories for each type of element (views, triggers, stored procedures, etc.). And it contains SQL files for each individual element, each one being the full SQL statement to effectively replace the element (well, sort of ... there is a utility that knows some things that mean we don't need EVERY statement).

So, just like any PHP/Go code needing to be reviewed, there is now also SQL.

Where as a migration is effectively saying add this line of code to the create table schema for a table (and executing the requirement), like a patch in a version control system, repeatables move the version control into git.

The reason is that if you want to alter a trigger on mysql, you cannot alter it. You can only drop and create. This is not the best way to have them, but that's what it is. It is also a good reason to now use triggers, to have a decent framework (or at least agreed upon standard for the team) to allow the trigger to be moved from RDBMS to app code in some way. But that's an entirely different discussion!

So, just like your project in git, repeatables need to be "released".

So we have a utility that does the following:

  1. It identifies, reads, and hashes (with all whitespace reduced to a single space) all the repeatables in the release checkout.
  2. It identifies, reads, and hashes (with all whitespace reduced to a single space) all the repeatables in the target RDBMS.
  3. It examines a table in the RDMBS that is used to record all the elements, and their hashes.
  4. It compares all the hashes to identify what is inconsistent.
  5. For those that do not exist in the RDBMS, they are added. This involves simply executing the statement in the repeatables SQL file. For views, in mysql, the statement starts with CREATE OR REPLACE as mysql allows a view to exist/not exist and to be created/replaced in 1 call. Internally, does mysql do all the necessary locking, etc.? Well, that's not my area of expertise, but I think it is an atomic action and so zero downtime.
  6. The hashes are logged in the repeatables table in the RDBMS.
  7. For those that no longer exist in the checkout, the repeatable is destroyed/dropped/deleted from within the RDBMS.
  8. The hash is removed from the repeatables table in the RDBMS.
  9. And so we're now left with those that exist in the checkout AND the RDBMS.
  10. If the hashes do not agree, then the replacement of the repeatable is required, but as not all of the repeatables have the SQL syntax of CREATE OR REPLACE, only CREATE, the utility will first drop the repeatable and then run the SQL statement that's in the checkout. As DML cannot be transactioned, there is the potential that a trigger may not exist whilst it is being replaced and so the potential for data inconsistency occurring exists.
  11. Once the repeatable is replaced, the hash is updated in the repeatables table in the RDBMS.
  12. BUT ... there is an issue.
  13. What happens if one of the repeatables refers to another repeatable that is not yet present?
  14. Each execution is tracked and any error (in PHP, you can talk to the RDBMS and have an exception generated for failures/errors) is examined and if it relates to a "missing" element, that repeatable is added to a retry set.
  15. When the initial set has run through to completion, if there are any entries to retry, then the process is repeated.
  16. Each repeat keeps track of any requires retries.
  17. If the retries remain the same, then they will never complete and so the only option is to quit and report back to the human responsible (normally this is handled by a human and so "on screen" is enough).
  18. If the process returns no retries required, then all the repeatables defined in the checkout are now in play. All old repeatabales have been removed from the RDBMS, and any that existed in the checkout AND in the RDBMS are now updated to match the checkout.

You may be wondering why the removal of white spaces and hashes and a repeateables table in the RDMBS.

  1. The SQL you have in your checkout is not binary identical to the version that you can retrieve from the RDBMS. Reduction of the white space seems to have been consistent for mysql. Other platforms may require a LOT more work.
  2. Considering the dropping and creating of triggers, stored procedures, and user defined functions cannot be done atomically, to reduce the unecessary replacement (drop and create) of repeatables that are unchanged, the difference in the logged hashes allows there to be a clear set of values to examine and so only result in the altered repeatables being replaced and thus reduce the potential for data loss.
  3. As it is possible for a human to run code on servers, the hashes logged not matching help identify if someone has altered something directly and so that inconsistency should stop any further adjustment, including a reset, as that may make things worse.

Our repeatables code also does a LOT of analysis and verification before the repeatable can be released. We have a "verify only" approach so that the commit pipeline can examine everything and ensure it is what the team wants. Whilst this is subjective, the consistency is a nice thing to have.

In reading through the other comments on this issue, I'd say the checksum table is pretty essential simple as there's no atomic way to execute the drop and create of a repeatable without that gap and so, doing that for EVERY repeatable, when potenitally none of them have been amended, is not the way to go. You will get inconsistent data in your database if you are using triggers. You'll get runtime errors if the stored procedure, or user defined function is not present.

I do not think that a migration is repeatable. Not in the sense of how Phinx and I'm assuming Goose currently operate.

Effectively, Phinx and Goose are VCS for RDBMS schema. Small "commits" to the schema. No different to PHP/Go code in git in this regard.

Repeatables simply do not fit into that paradignm. And so, whilst the utility known as Goose may eventually support repeatables, they should not be confused with migrations. Repeatables exist primarily because they have to exist to allow amendments to the RDBMS due to the lack of incremental/diff patching. The schema is "gittable" using migrations. The repeatables within an RDBMS haven't progressed much since FTP for very very old code deployments onto running hardware, and you still have all the same issues as you'd have back then.

I hope this all makes SOME sense! I know it may come over as a bit opinionated. If that's the case, just ignore me! I'm opinionated!

If there are any questions then please ask!