pgautoupgrade / docker-pgautoupgrade

A PostgreSQL Docker container that automatically upgrades your database
https://hub.docker.com/r/pgautoupgrade/pgautoupgrade
MIT License
537 stars 19 forks source link

Automatically do post upgrade steps (vacuum analyze, database reindexing) #50

Closed justinclift closed 1 day ago

justinclift commented 4 days ago

This is initial code for automatically doing post upgrade tasks (vacuum analyze to refresh query planner stats, reindex the databases) after a major version upgrade.

It's passing in the initial testing I've been running locally, though I haven't tried anything too funky yet. The code itself is pretty straight forward, so it shouldn't be all that fragile. Hopefully. :smile:

justinclift commented 4 days ago

@andyundso It seems like we could potentially reindex the databases in a fairly simple way after upgrades using an approach like this.

What do you reckon? The code has been significantly reworked since originally writing this comment. :wink:

justinclift commented 4 days ago

Note that I'm unsure (so far) how to properly test this. Might need to ask on an appropriate PostgreSQL developers mailing list for ideas. :smile:

andyundso commented 4 days ago

I have a couple of thoughts, already sorry for the wall of text :sweat_smile:


How do we know what we need to do after an upgrade? I recently manually upgraded Postgres from v13 to v16 (application is not containerized so we also did not bother to containize Postgres) and there pg_upgrade printed something like this:

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
    /home/greg/pg/16/bin/vacuumdb --all --analyze-in-stages
Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh

I also read through https://www.crunchydata.com/blog/examining-postgres-upgrades-with-pg_upgrade and there they suggested that extensions sometimes also require their own upgrade.

Reindexing seems to only necessary only in certain cases, like you mentioned here.


I'm kind of wondering if we might want to take a different approach, and have the reindex operation run concurrently (ie REINDEX DATABASE CONCURRENTLY) after actually starting the database properly so applications can use it meanwhile.

I would suggest we run CONCURRENTLY just to speed up things and prevent unnecessary blocks on the table. I would implement it as follows:

  1. Run pg_upgrade if necessary, as currently implemented.
  2. Launch the PostgreSQL server.
  3. In a separate process, launch the reindex operation.

I know it is against the rule of Docker to have more than one "main" process in a container. I think it would be good to have the server up as soon as possible, but still run the reindex operation, therefore the suggestion with the separate process.


I am wondering if we should drop PGAUTO_REINDEX and just always execute these "post upgrade actions". Or have a variable to opt-out of this behaviour. I personally would like that these post-upgrade actions are executed automatically as I do not want to bother with that, but other people might have different preferences.


Note that I'm unsure (so far) how to properly test this.

as an idea: maybe we can query PG from our test.sh about tasks that have ran recently. we could define a timeout of 1 minute where we would expect that the REINDEX operation has run.

justinclift commented 4 days ago

Ugh, I totally forgot about the whole "vacuum analyse" thing afterwards. :facepalm:

Yeah, we'd better do that too.


... extensions sometimes also require their own upgrade.

:grimacing:

Well, I'm calling "edge case" on that and reckon we can look at PG extension upgrades at a later point. Lets get the fundamentals working ok first. :smile:


Reindexing seems to only necessary only in certain cases, like you mentioned https://github.com/pgautoupgrade/docker-pgautoupgrade/issues/10#issue-1845476330.

Yeah. While we could create a big multi-dimensional matrix to track the version upgrades where it is and isn't needed... I reckon that's more an optimisation for a (maybe) later date. :grin:


I would suggest we run CONCURRENTLY

Heh Heh Heh, I was afraid of that. :grin:

I guess we'd better do it properly rather than cheap+easy. :wink:

We'll probably need to figure out a reliable way of forking a background process to do the reindexing in the background when the main database process is started.

We can probably just launch some kind of bash script in the background that waits for the database to report it's available, then we run the post upgrade tasks and log the progress/errors somewhere from that.


I am wondering if we should drop PGAUTO_REINDEX and just always execute these "post upgrade actions".

After going through the implementation process a bit, I reckon you're right. That variable has been removed. :smile:


maybe we can query PG from our test.sh about tasks that have ran recently.

Oh, that's a good idea. PostgreSQL also apparently has a progress counter thing for reindexes available via some system table. At least, that's what the docs mention though I've not looked at the details:

Each backend running REINDEX will report its progress in the pg_stat_progress_create_index
view. See Section 27.4.4.

This is the Section 27.4.4 it refers to.

we could define a timeout of 1 minute where we would expect that the REINDEX operation has run.

Nah, probably not the right approach. Large databases can take ages to reindex (even hours), especially if it's done concurrently.

andyundso commented 4 days ago

We can probably just launch some kind of bash script in the background that waits for the database to report it's available, then we run the reindexing operations and log the progress/errors somewhere from that.

had the same thought. we can probably just something like that in the entrypoint script.

bash /opt/reindex.sh &
exec "$@"

we could define a timeout of 1 minute where we would expect that the REINDEX operation has run.

Nah, probably not the right approach. Large databases can take ages to reindex (even hours), especially if it's done concurrently.

I meant a timeout of 1 minute when we test the container with test.sh. I agree for real world use cases we should not define a timeout.

justinclift commented 4 days ago

Cool, yeah good thoughts. I'll probably have a go at adding some of those pieces over the next few days if no-one else gets to it first.

That being said, if anyone else wants to try implementing bits first then please do. :grin:

justinclift commented 2 days ago

k, I think the post upgrade scripting is mostly done. Or at least to the point where it's suitable for looking over for any obvious problems. :smile:

One thing it doesn't yet handle is upgrading any extensions. We'll probably want to have a look at automating that in this PR too if it seems fairly straight forward (I haven't checked).

justinclift commented 2 days ago

As a data point, I'm manually testing this on my local development computer by first manually creating a PG 16 database directory locally:

$ mkdir postgres-data
$ docker run --rm --name pg -it \
  --mount type=bind,source=$(pwd)/postgres-data,target=/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=abc123 \
  postgres:16-alpine

Then running the locally built pgautoupgrade/pgautoupgrade:dev container on that same PG data directory:

$ docker run --rm --name pgauto -it \
  --mount type=bind,source=$(pwd)/postgres-data,target=/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=abc123 \
  pgautoupgrade/pgautoupgrade:dev

When the post upgrade script runs, its stdout and stderr output is captured into a file in the $PGDATA directory. For example, 2024.10.02-03.28-pgautoupgrade.log in the directory listing below:

$ sudo ls -la postgres-data
total 46
drwx------ 19 70 jc    27 Oct  2 13:30 .
drwxrwxr-x  7 jc jc    19 Oct  2 13:27 ..
-rw-r--r--  1 70 70 94211 Oct  2 13:28 2024.10.02-03.28-pgautoupgrade.log
drwx------  5 70 70     5 Oct  2 13:28 base
-rwx------  1 70 70    49 Oct  2 13:28 delete_old_cluster.sh
drwx------  2 70 70    65 Oct  2 13:30 global
drwx------  2 70 70     2 Oct  2 13:28 pg_commit_ts
drwx------  2 70 70     2 Oct  2 13:28 pg_dynshmem
-rw-------  1 70 70  5743 Oct  2 13:28 pg_hba.conf
[etc]

and:

$ sudo head postgres-data/2024.10.02-03.28-pgautoupgrade.log
----------------------------
Updating query planner stats
----------------------------
INFO:  vacuuming "postgres.pg_catalog.pg_statistic"
INFO:  finished vacuuming "postgres.pg_catalog.pg_statistic": index scans: 0
pages: 0 removed, 19 remain, 19 scanned (100.00% of total)
tuples: 0 removed, 410 remain, 0 are dead but not yet removable
removable cutoff: 753, which was 0 XIDs old when operation ended
new relfrozenxid: 753, which is 22 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
[etc]
justinclift commented 2 days ago

If anyone's interested in testing this PR on their database, I've just pushed an image (x86_64 only) built using this PR to Docker Hub using the dev tag. ie:

image: pgautoupgrade/pgautoupgrade:dev

It'd expect it to work without issue. Shouldn't be any errors or anything along those lines. In theory. :smile:

justinclift commented 1 day ago

It took the one shot testing a bit more effort than I was expecting, but it should be workable for now.

Still need to properly look into the upgrading of extensions. Probably a tomorrow thing, as it's not something for tonight.

justinclift commented 1 day ago

I suppose we can close #10 after merging this?

Almost. Lets merge this PR for now, but investigate (and maybe implement?) the extension upgrade piece before closing #10.

I mean, it might turn out to be mostly straight forward. Hopefully. :pray:

justinclift commented 1 day ago

Uh oh. Right after merging this I've realised that hard coding the database name of postgres in the post upgrade script could break things for people that use (only) a different database name:

https://github.com/pgautoupgrade/docker-pgautoupgrade/blob/93c5fd86951fa2ccb57116bde01e6d5ce61aaf17/pgautoupgrade-postupgrade.sh#L40

I'll test locally if using one of template databases will work instead (pretty sure it will), then I'll create a PR to fix that.

justinclift commented 1 day ago

Instead of using a template database, I've just passed through the name of the database being used by the docker entrypoint script and re-used that. Theoretically (!) it should be fine. :grin: