DB fails to start after upgrade is interrupted

pgautoupgrade / docker-pgautoupgrade

A PostgreSQL Docker container that automatically upgrades your database

https://hub.docker.com/r/pgautoupgrade/pgautoupgrade

MIT License

507 stars 18 forks source link

DB fails to start after upgrade is interrupted #42

Closed miguelhar closed 5 days ago

miguelhar commented 1 week ago

We are using Image pgautoupgrade/pgautoupgrade:14-dev

We are seeing an issue on larger DBs that the init container running this image takes longer than the default thresholds and gets terminated. Upon the restart the DB fails to start with:

postgresql 11:42:50.%2N INFO  ==> ** Starting PostgreSQL **
2024-08-15 11:42:50.466 GMT [1] LOG:  skipping missing configuration file "/bitnami/postgresql/data/postgresql.auto.conf"
2024-08-15 11:42:50.466 GMT [1] FATAL:  "/bitnami/postgresql/data" is not a valid data directory
2024-08-15 11:42:50.466 GMT [1] DETAIL:  File "/bitnami/postgresql/data/PG_VERSION" is missing.

It seems that after the upgrade is interrupted the upgrade is not retried leaving the DB in a broken state.

Aside from incrementing the liveness probe timeout/failure threshold for the init container running this image is there something that could be set so that the upgrade is retried?

justinclift commented 1 week ago

@miguelhar How big is your database, and how long is the default timeout on the platform you're using?

Thinking things through a bit, my initial thought that retrying the upgrade without increasing the timeout value is pretty likely to not result in a success. The upgrade scripting itself is mostly a case of:

Move things into a directory structure that pg_upgrade (the official PostgreSQL cli tool for upgrading) can work with, then
Run the pg_upgrade cli utility

While the are some sanity checks done first (ie here), the actual upgrade piece is done using pg_upgrade.

If pg_upgrade has been terminated part way through... I kind of doubt it'll be able to happily continue on and complete what wasn't done in a previous run.

Long story short, I reckon you'll need to restore your database files back to how they were before the upgrade, then try the upgrade again with a longer timeout threshold.

Sorry I don't have better news nor suggestions @miguelhar. :frowning:

miguelhar commented 6 days ago

@justinclift thank you for your response, the size varies but increasing the livenessProbe/failureThreshold to 20 seems to do the trick

justinclift commented 5 days ago

Awesome. I bet that took a bunch of time and effort to work through and get happening.

Those two terms (livenessProbe, failureThreshold) show up as Kubernetes related things. Are you managing Kubernetes yourself, or are you using the offering from one of the big vendors?

Asking because it might be useful for future people to directly mention the relevant vendor here, so others that hit the same issue can see whether it relates to them. :smile: