Problems when upgrading cluster

rcknr commented 5 months ago

Last week I did a major version upgrade with the playbook and encountered a few issues which I want to share.

During the upgrade maintenance_enable and maintenance_disable roles are used. Their functions however are somewhat different: enable role disables confd and deploys a temporary configuration for haproxy which disables healthchecks but also stops patroni cluster (it also handles vip-manager tasks but I don't use that). Disable role, however, only deals with confd/haproxy/vip-manager but not patroni. These tasks are executed on database nodes, while conf/haproxy are deployed to balancers host group from the inventory. Therefore, when you run an upgrade the playbook is trying to stop confd on cluster nodes which don't have it and fails.
After I initially deployed my cluster I tried various settings to adjust my setup and once set an invalid value for log_timezone parameter. I have fixed that long ago but during the upgrade patroni got this old config from somewhere and tried to start new postgres version with that incorrect value which caused a failure loop. I couldn't figure out where it was coming from for a while but then I found patroni.dynmic.json file located in my data directory which was used to generate settings for the new version. I think that the best course of action would be to use the latest DCS config to start not that file which was somehow persisted in data directory.

So item 1 definitely looks like a bug to me, while item 2 is mostly my own mistake and lack of understanding of patroni configuration but I think it should be highlighted so others are aware of that during the upgrade.

rcknr commented 5 months ago

Does it make sense to make the following changes?

Move stopping patroni cluster from maintenance_enable to stop_services role, leaving maintenance_enable and maintenance_disable roles to take care of confd/haproxy/vip-manager.
Extract maintenance_enable and maintenance_disable tasks from (5/6) UPGRADE: Upgrade PostgreSQL group to be executed before and after it correspondingly on balancers hosts.

If that's fine, I can produce a PR.

vitabaks commented 5 months ago

If the tasks are performed on the wrong nodes, then this is an mistake and must be performed on the appropriate host groups or use delegate_to

I'll take a look at it later.

vitabaks commented 3 months ago

Fixed here https://github.com/vitabaks/postgresql_cluster/pull/699

vitabaks / postgresql_cluster

Problems when upgrading cluster #666