vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.37k stars 376 forks source link

Problems when upgrading cluster #666

Closed rcknr closed 1 week ago

rcknr commented 2 months ago

Last week I did a major version upgrade with the playbook and encountered a few issues which I want to share.

  1. During the upgrade maintenance_enable and maintenance_disable roles are used. Their functions however are somewhat different: enable role disables confd and deploys a temporary configuration for haproxy which disables healthchecks but also stops patroni cluster (it also handles vip-manager tasks but I don't use that). Disable role, however, only deals with confd/haproxy/vip-manager but not patroni. These tasks are executed on database nodes, while conf/haproxy are deployed to balancers host group from the inventory. Therefore, when you run an upgrade the playbook is trying to stop confd on cluster nodes which don't have it and fails.
  2. After I initially deployed my cluster I tried various settings to adjust my setup and once set an invalid value for log_timezone parameter. I have fixed that long ago but during the upgrade patroni got this old config from somewhere and tried to start new postgres version with that incorrect value which caused a failure loop. I couldn't figure out where it was coming from for a while but then I found patroni.dynmic.json file located in my data directory which was used to generate settings for the new version. I think that the best course of action would be to use the latest DCS config to start not that file which was somehow persisted in data directory.

So item 1 definitely looks like a bug to me, while item 2 is mostly my own mistake and lack of understanding of patroni configuration but I think it should be highlighted so others are aware of that during the upgrade.

rcknr commented 2 months ago

Does it make sense to make the following changes?

If that's fine, I can produce a PR.

vitabaks commented 2 months ago

If the tasks are performed on the wrong nodes, then this is an mistake and must be performed on the appropriate host groups or use delegate_to

I'll take a look at it later.

vitabaks commented 1 week ago

Fixed here https://github.com/vitabaks/postgresql_cluster/pull/699