konveyor / tackle-pathfinder

Tackle Pathfinder application
Apache License 2.0
16 stars 23 forks source link

Pathfinder prevents the PostgreSQL container from shutting down gracefully. #323

Open jmontleon opened 1 year ago

jmontleon commented 1 year ago

If the postgres container needs to be shut down for any reason (could be a node being drained for maintenance, upgrade, etc. as an example) Pathfinder appears to prevent postgres from shutting down gracefully. We always run up to the grace period timeout on the container and postgres gets killed leading to risk of corruption or data loss.

After some investigation it's my understanding that the container runtime for Kubernetes/OpenShift sends SIGTERM to stop the process. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

PostgreSQL will wait forever for all connections to terminate before shutting down when it receives SIGTERM https://www.postgresql.org/docs/current/server-shutdown.html

We're not setting a grace period on the PostgreSQL container so we're getting the default terminationGracePeriodSeconds: 30.

Meanwhile the quarkus.datasource.jdbc.idle-removal-interval default is 5m.

I upped the grace period to about 120s and reduced the idle-removal-interval to 60 seconds using the environment variable and the DB has thus far stopped cleanly after about 80-90 seconds each time. I think we need to come up with a reasonable set of values for both of these with the grace period being a fair bit larger than the removal interval and I'm looking for some input for what would be acceptable on the JDBC side.

I'm also curious if we have any means to keep the use of any single connection in the pool fairly short as I think the grace period needs to exceed non-idle time + idle time with some reasonable duration to spare.

jmontleon commented 1 year ago

Any other approach to allow cleanly shutting down the DB would also be welcome as well.

PhilipCattanach commented 1 year ago

I have asked @m-brophy to look into this. Thanks.