Function source normalization (pg_dump)

The Codd representation of a PostgreSQL function/procedure includes the MD5 hash of the source code of the function body when the function is implemented in SQL or PL/pgSQL (reference). This function body source code is a string that includes the exact whitespace used during creation. This whitespace is collapsed when the database is dumped using pg_dump, however, causing a difference in representation.

This issue can be avoided by not using pg_dump schema dumps. For example, dumps are often used when upgrading PostgreSQL. An alternative is to initialize the schema on the new version from the migrations (using Codd) and to only use pg_dump to migrate the data. Such schema dumps are the de facto standard way to save a snapshot of a database (schema), however, so being able to compare them would be nice.

This issue could be resolved by normalizing the function body source code. This requires parsing the source code. (Doing this correctly requires parsing of nested dollar-quoted strings using identifiers, using an FSM augmented with an identifier stack.) I have not tried implementing this, but my current idea of how to do so it to change the representation to include the full source code. An option could then be used to determine how that source code is compared (equality by default, normalized when requested). Would including the full source code in the representation be problematic?

FYI: I am currently working on a checkpoint migration, and I hacked some options into the codd verify-schema command. The code is pushed to the verify-schema-options branch of the BeFunctional/codd fork. Please see the commit message for details.

As detailed in the commit message, a primary goal of this commit was to minimize change of existing code. It is not intended to be merged, since a much more elegant solution could be made if existing code is refactored.

The --ignore-fun-def option is a temporary measure that is not needed if we can resolve this issue. The --ignore-col-ord would be useful for verifying checkpoints in general, though. Would you be interested in adding support fur such verify options?

I am not able to reproduce the issue with whitespace in functions. I created this function:

CREATE FUNCTION test_function_with_whitespace() RETURNS INT AS $$
    BEGIN
        SELECT 1;

        -- White space

              -- More white space

    END
$$ LANGUAGE plpgsql;

Then I added this as a migration, ran pg_dump and copied the CREATE FUNCTION statement from it, manually dropped and recreated the function with that statement from pg_dump and ran codd verify-schema. It passed. In fact, I've diffed the function bodies from pg_dump and the original migration and they're exactly the same.

I'm using postgresql 15.2 and up-to-date psql.

The --ignore-col-ord would be useful for verifying checkpoints in general, though

From the name of the option, am I right to think this is the same as setting the env var CODD_SCHEMA_ALGO=ignore-column-order (docs here)?

This issue could be resolved by normalizing the function body source code. This requires parsing the source code. (Doing this correctly requires parsing of nested dollar-quoted strings using identifiers, using an FSM augmented with an identifier stack.)

Codd has a SQL parser that understand quotes and dollar quotes (it is necessary mostly due to no-txn migrations). It is a "simple" parser, in the sense that it only understands a few trivial statements, COPY FROM STDIN, white space, comments and other than that it only knows statement boundaries.

So maybe we could fetch entire SQL bodies and hash them after stripping them of white space with codd's sql parser. But I'd be nice for me to reproduce the issue first.

I am not able to reproduce the issue with whitespace in functions.

Interesting! I confirmed this using PostgreSQL 15.2, and I am happy to see that this was changed!

I am currently using PostgreSQL 11, which has the problematic behavior that I described.

From the name of the option, am I right to think this is the same as setting the env var CODD_SCHEMA_ALGO=ignore-column-order (docs here)?

That environment variable setting changes the representations, while my --ignore-col-ord hack instead changes how representations are compared. I want to verify column order in general but be able to disregard it (only) when verifying checkpoint migrations that optimize alignment.

BTW, the linked documentation says "in some cases that is just too hard to do right for everyone." This is so true! :smile:

I wrote:

(Doing this correctly requires parsing of nested dollar-quoted strings using identifiers, using an FSM augmented with an identifier stack.)

Quick correction! I have since implemented this and found that nested dollar-quoted strings are not checked; PostgreSQL just searches for the closing tag that matches the opening tag without parsing the tags of any nested dollar-quoted strings. A stack is therefore not required.

Codd has a SQL parser that understand quotes and dollar quotes (it is necessary mostly due to no-txn migrations). It is a "simple" parser, in the sense that it only understands a few trivial statements, COPY FROM STDIN, white space, comments and other than that it only knows statement boundaries.

IMHO, it is really unfortunate that we have to parse SQL. Covering the simple cases is easy, but there are many non-simple cases. I understand that no-txn migrations are required to create a database, drop a database, or alter system configuration, but these are all things that I prefer to do externally. Are there other cases when no-txn is necessary?

So maybe we could fetch entire SQL bodies and hash them after stripping them of white space with codd's sql parser. But I'd be nice for me to reproduce the issue first.

PostgreSQL 11 should be able to reproduce the issue.

Would storing the full source be problematic? Is a hash used just to keep the size down?

I am currently using PostgreSQL 11, which has the problematic behavior that I described.

I see. I'll try to find the most recent version of postgres where this still happens.

That environment variable setting changes the representations, while my --ignore-col-ord hack instead changes how representations are compared. I want to verify column order in general but be able to disregard it (only) when verifying checkpoint migrations that optimize alignment.

Ah, interesting! I wonder if we can think of some sensible code-like structure that specifies how things should be compared? I fear the proliferation of options. Although.. having some file in some format to specify this also feels icky; maybe having a ton of options that customize the comparison algo is better?

IMHO, it is really unfortunate that we have to parse SQL. Covering the simple cases is easy, but there are many non-simple cases. I understand that no-txn migrations are required to create a database, drop a database, or alter system configuration, but these are all things that I prefer to do externally. Are there other cases when no-txn is necessary?

It really is unfortunate, and I actually made a gross simplification saying it's only due to no-txn migrations. It is used for much more, including the ability to apply COPY FROM STDIN as it's not just a regular statement - it's a different postgresql-simple function and a different part of the protocol with postgres IIUC. It's also used to read migrations from disk in streaming fashion, so that codd can handle arbitrarily large sql migrations with bounded memory usage. Plus, no-txn migrations are necessary for some more "trivial" things like adding values to an enum and using those new values in statements.

There's a note in the code about it from a while ago: https://github.com/mzabani/codd/blob/0b31d04ad2b0a04ef5d664bc6608b1dfdbe8f3d0/src/Codd/Internal/MultiQueryStatement.hs#L35-L44

Would storing the full source be problematic? Is a hash used just to keep the size down?

Yeah, my thinking is the has is just used to keep the size down. It wouldn't be really problematic to store the full contents (I think in some places where I forgot to sprinkle md5(value) it still happens), but seems like it's unnecessary? IIUC we can fetch the full contents from the database, parse then in memory and strip whitespace off, then hash in memory. We can in fact do this only for affected versions of postgresql, so that we can keep working versions intact.

Ah, interesting! I wonder if we can think of some sensible code-like structure that specifies how things should be compared? I fear the proliferation of options. Although.. having some file in some format to specify this also feels icky; maybe having a ton of options that customize the comparison algo is better?

Hmmm, I do not foresee too many options, but perhaps I am being naïve.

Assuming that no comparison options will need to be parameterized, one UI technique is to represent them using strings and not expose them as separate optparse-applicative options. A generic --options/-o OPTION option can be used to parse multiple options or (alternatively) a single option could be used to represent a list (using CSV syntax for example). Documentation could be included in --help output while small, and it can be moved to a separate command if it gets large.

$ codd show-comparison-options
...

ignore-column-order
  Ignore the order of columns in tables.  This is useful when optimizing
  alignment.

...
$ codd verify-schema -o ignore-column-order,ingore-function-source
error: unknown comparison option: ingore-function-source
See `codd show-comparison-options` for a list of supported comparison options.

In my hack, I implemented the functionality by first comparing representations using equality and then filtering the result using an equality predicate that is created based on options, when specified. The diffDbRep function compares using equality, and note that the differences are annotated with the ObjectRep. The filterDiff function filters using an equality predicate. The combinePredicates function combines any number of equality predicates into one, making it easy to create an equality predicate based on options. (There are additional notes in the commit message.)

It really is unfortunate, and I actually made a gross simplification saying it's only due to no-txn migrations. It is used for much more, including the ability to apply COPY FROM STDIN as it's not just a regular statement - it's a different postgresql-simple function and a different part of the protocol with postgres IIUC. It's also used to read migrations from disk in streaming fashion, so that codd can handle arbitrarily large sql migrations with bounded memory usage. Plus, no-txn migrations are necessary for some more "trivial" things like adding values to an enum and using those new values in statements.

That is more complicated that I had thought... Thank you for the details!

This transaction handling is also unfortunate because of rollback. When a migration is split into multiple transactions and a non-first transaction fails, then the resulting state represents partial application of the migration. I imagine that recovering from such an issue could require writing special migrations. It would be very dangerous if this happens in production if the resulting state is incompatible with the clients! I do not think that this is even viable in critical environments. In that case, one should split such migrations up into separate steps so that each step runs in a single transaction. The necessary inverses can then be developed and tested for all failure cases.

This also complicates the schema consistency check. If that check is done as part of the last transaction of a series of migrations, and consistency fails, then rollback does not necessary rollback the whole series of migrations, as described above.

Thinking aloud about my use case (not necessarily making suggestions for Codd):

I do not use COPY commands in migrations, so I do not need to worry about them.
Without COPY data, perhaps I do not need to worry about streaming migration source...
If each migration (step) is run in a separate transaction, then perhaps things like adding enum values can simply be done in a separate migration/step so that they can be used in subsequent migrations/steps.
If I can enforce that each migration (step) is run within a single transaction, then I can avoid the mid-migration state issue.

There's a note in the code about it from a while ago: ...

Thank you for pointing that out! I did not know that some PostgreSQL statements (such as CREATE RULE) use semicolons. My sqlCollapseWhitespace function collapses whitespace after statements to a newline, so I updated the implementation to take this into account.

I have not worked through the Codd parser, but here are a few things that I noticed are not supported when briefly looking over it:

C-style comments can be nested, so counting is required for that as well. (Standard SQL)

/* C-style comment */

/* /* Nested */ C-style comment */

/* Non-terminated /* C-style comment */

Escape strings are not supported. (PostgreSQL extension)
```
SELECT E'''20s' = E'\'20s';
```

Yeah, my thinking is the has is just used to keep the size down. It wouldn't be really problematic to store the full contents (I think in some places where I forgot to sprinkle md5(value) it still happens), but seems like it's unnecessary?

The benefit of storing the full source is that the software would then be able to show exactly how two functions differ, not just that they differ. It is just a convenience, to save users the need to do so manually.

IIUC we can fetch the full contents from the database, parse then in memory and strip whitespace off, then hash in memory.

Yes, the Value loaded from the database can be transformed before it is used.

My current implementation is in Collapse SQL Whitespace: Part 2, and additional notes (and an approximate state machine diagram) can be found the previous version in Collapse SQL Whitespace. Note that I use String in this implementation because I plan to offer to implement the sql quasiquoter using it, after any issues are resolved.

We can in fact do this only for affected versions of postgresql, so that we can keep working versions intact.

In this case, upgrading PostgreSQL would make it look like the function source differs even when it does not. This is not a significant issue, though.

Hmmm, I do not foresee too many options, but perhaps I am being naïve.

I ran into a curious case this week at work, where I could conceivably want different index fillfactors for different environments to fine-tune the balance between having a small test database that still allows index keys of relatively close values to be stored on different index pages (this is required to test invariants related to the concurrency of serializable transactions), but having a Production environment where indexes aren't too spread out. For simplicity, I still want to keep a single folder with the expected schema of Production, and just ignore this setting when creating the test database with codd. I guess this is the kind of convoluted requirement that appears in projects as they grow, but driven by (what seems to be) a real necessity.

But I agree with you these things sound hard to come by. Still, I think there's a way to be flexible enough to accommodate this kind of thing, which is what I address next:

Assuming that no comparison options will need to be parameterized, one UI technique is to represent them using strings and not expose them as separate optparse-applicative options. A generic --options/-o OPTION option can be used to parse multiple options or (alternatively) a single option could be used to represent a list (using CSV syntax for example). Documentation could be included in --help output while small, and it can be moved to a separate command if it gets large.

I've thought of something that might be sufficiently generic but still relatively simple. The downside is codd must keep its disk layout and representations stable.

To ignore column order, for instance, It'd something like codd [up|verify-schema] --ignore "schemas/*/tables/*/cols/*:.order. The part before the colon references one or more files and the part after it is a jq-esque filter to choose what to ignore. Like you suggested, this option could be used multiple times (the comma separation might be tricky for this one).

This transaction handling is also unfortunate because of rollback

I'm probably confused, but I think I don't understand how transaction handling or the parser relate to that? For what it's worth, somewhere in codd's docs it's written to be very careful with no-txn migrations. I think the implicit ordering of migrations that codd facilitates (i.e. since each developer runs codd add their-migration.sql separately and "forgets" about how this will be handled) is to blame here: in the presence of no-txn migrations, thinking about order of application becomes very important.

Your idea of a DAG of schema states forces users to think of that, but there is still the question of "which migrations are pending for environment Staging/Prod/X?", which I think can only be answered by a good CI pipeline that knows this and triggers warnings in the presence of no-txn migrations, or something along those lines.

I have not worked through the Codd parser, but here are a few things that I noticed are not supported when briefly looking over it:

Yeah, and ouch, and thanks for reminding me of those things. I should probably make them my top priority.

I did not know that some PostgreSQL statements (such as CREATE RULE) use semicolons

Oh my, neither did I! Once again, thanks!

Longer term, it'd probably be better to take that note I wrote seriously and use the upstream parser rules somehow.

The benefit of storing the full source is that the software would then be able to show exactly how two functions differ, not just that they differ. It is just a convenience, to save users the need to do so manually.

Indeed, very good point. This reminds me codd should really be showing a JSON diff when schemas mismatch, for which I'll create an issue.

In this case, upgrading PostgreSQL would make it look like the function source differs even when it does not. This is not a significant issue, though.

I just ran into an interesting case where postgresql replaces trim with btrim when upgrading between major versions with dumps. I agree with you this is not significant, and I'd rather not make dump+restore=identity an invariant across major versions, but definitely would be nice to make it for the same version of postgresql (which this GH issue relates to).

I forgot to mention, codd has a provision for when it errors out parsing SQL: you can add -- codd: no-parse to the top of your migrations (the parser will still be used to detect such comments), but once that's detected the entire migration is read from disk into memory, and sent to the server in a single libpq statement. This is not documented explicitly but appears in an error message if the parser ever errors out.

This is of course different from codd's parser parsing statement boundaries incorrectly, which I hope to address in a newly created issue, https://github.com/mzabani/codd/issues/153

I have pushed a draft to ensure that pg_dump -> restore is an identity wrt codd schemas, and the test function with empty spaces is there and passes, even with postgres 11 (it won't pass in CI because I haven't populated the Nix cache, but it does pass locally). Here's the tested migration:

https://github.com/mzabani/codd/pull/157/files#diff-afc1e9a53917d6ecd225f177a07bb1f8391101bc3958246e32ea3f1fb301e872R146-R158

I wonder if the problem is with some specific minor version of postgresql and/or pg_dump? The one used in codd's Nix environment is postgresql-11.20.

I merged the PR with the pg_dump + restore = identity test given the utility of such a test, but I'm not marking this as fixed until we get to the bottom of the issue and find a way to make things work nicely.

I'm taking the liberty of closing this issue for a few reasons:

There is https://github.com/mzabani/codd/issues/167 with a solution to the problem of relaxing consistency equality checks, and it solves all the cases I could find in this thread
Some of the parsing bugs pointed out here have been fixed a while ago
I just merged a PR that drops support for PG 11 and earlier versions, where the pg_dump bug here was reproduced.

But please reopen it (or open a new one, up to you!) if you can reproduce it in pg >= v12.

mzabani / codd

Function source normalization (pg_dump) #139