Patroni Error in process_unhealthy_cluster

ponvenkates commented 3 years ago

Describe the bug We are using an in-house docker image with base PostgreSQL image from bitnami + pip3 installtion of Patroni pip3 install patroni==2.0.2 . These are the same versions that was working as expected in previous builds (even last week). Today I rebuild the same image as it is (without any change) and it breaks the cluster completely, i rollback to my previously tag (which technically should be the same) and everything works fine. Not sure if this is indeed an issue with patroni, but raising it here hoping I could get some pointers, also this is an unhandled exception, may be I could be of some help (if any) to the developers to handle the exception in future releases.

Patroni is encountering an internal error in PostgreSQL when trying to get the timelines cannot execute SQL commands in WAL sender for physical replication and due to that (i guess) there's a type error encountered. At the end, the cluster is left with no leader and all nodes reporting the same errors repeatedly.

Screen shots

Environment

Number of nodes: 3
Patroni version: 2.0.2
PostgreSQL version: 11.12
DCS (and its version): consul 1.9.2

Patroni configuration file

# base settings
bootstrap:
  initdb:
  - encoding: UTF8
  - data-checksums

  post_init: /post_init.sh

  pg_hba:
  - host all all 0.0.0.0/0 md5
  - local all all    md5

postgresql:
  _comment: postgres settings

log:
  _comment: Log configuration
  level: DEBUG
  traceback_level: ERROR

patronictl show-config

check_timeline: true
log:
  level: DEBUG
loop_wait: 4
master_start_timeout: 60
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    max_connections: 50
    max_replication_slots: 5
    max_wal_senders: 20
    max_worker_processes: 9
    unix_socket_directories: /tmp
    wal_keep_segments: 512
    wal_level: logical
  pg_hba:
  - local all all    md5
  - host all all 0.0.0.0/0 md5
  - host replication +service_replicators,postgres_repl_user 0.0.0.0/0 md5
  remove_data_directory_on_diverged_timelines: true
  remove_data_directory_on_rewind_failure: true
  use_pg_rewind: true
  use_slots: true
retry_timeout: 8
ttl: 20

Have you checked Patroni logs?

2021-06-17 08:44:12.644 GMT [1] LOG: starting patroni
2021-06-17 08:44:14.741 GMT [91] INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-06-17 08:44:14.757 GMT [91] INFO: Deregister service postgres/pg-one-photon
2021-06-17 08:44:14.763 GMT [91] WARNING: Postgresql is not running.
2021-06-17 08:44:14.763 GMT [91] INFO: Lock owner: None; I am pg-one-photon
2021-06-17 08:44:14.767 GMT [91] INFO: pg_controldata:
  pg_control version number: 1100
  Catalog version number: 201809051
  Database system identifier: 6968785271539003528
  Database cluster state: shut down in recovery
  pg_control last modified: Thu Jun 17 08:43:25 2021
  Latest checkpoint location: 0/6006710
  Latest checkpoint's REDO location: 0/60066D8
  Latest checkpoint's REDO WAL file: 000000110000000000000006
  Latest checkpoint's TimeLineID: 17
  Latest checkpoint's PrevTimeLineID: 17
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:586
  Latest checkpoint's NextOID: 40976
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 562
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 585
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Tue Jun 15 05:29:34 2021
  Fake LSN counter for unlogged rels: 0/1
  Minimum recovery ending location: 0/6016738
  Min recovery ending loc's timeline: 17
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 200
  max_worker_processes setting: 8
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: c6e9a54bf2e0d8483ff8e0b3ed1eccf34b97434d2d605636083d982c38af9dd7

2021-06-17 08:44:14.835 GMT [91] INFO: Lock owner: None; I am pg-one-photon
2021-06-17 08:44:14.835 GMT [91] INFO: Lock owner: None; I am pg-one-photon
2021-06-17 08:44:14.836 GMT [91] INFO: starting as a secondary
2021-06-17 08:44:14.839 GMT [91] INFO: Lock owner: None; I am pg-one-photon
2021-06-17 08:44:14.839 GMT [91] INFO: not healthy enough for leader race
2021-06-17 08:44:14.853 GMT [91] INFO: restarting after failure in progress
2021-06-17 08:44:15.940 GMT [91] INFO: postmaster pid=117
2021-06-17 08:44:16.238 GMT [117] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-06-17 08:44:16.242 GMT [117] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-06-17 08:44:16.634 GMT [117] LOG:  could not open usermap file "/opt/bitnami/postgresql/conf/pg_ident.conf": No such file or directory
2021-06-17 08:44:16.639 GMT [120] LOG:  database system was shut down in recovery at 2021-06-17 08:43:25 GMT
2021-06-17 08:44:16.640 GMT [120] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2021-06-17 08:44:16.640 GMT [120] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2021-06-17 08:44:16.640 GMT [120] LOG:  entering standby mode
2021-06-17 08:44:16.647 GMT [120] LOG:  redo starts at 0/60066D8
2021-06-17 08:44:16.737 GMT [120] LOG:  consistent recovery state reached at 0/6016738
2021-06-17 08:44:16.738 GMT [117] LOG:  database system is ready to accept read only connections
2021-06-17 08:44:16.739 GMT [120] LOG:  invalid record length at 0/6016738: wanted 24, got 0
2021-06-17 08:44:17.539 GMT [140] LOG:  incomplete startup packet
2021-06-17 08:44:17.644 GMT [91] INFO: establishing a new patroni connection to the postgres cluster
2021-06-17 08:44:17.938 GMT [142] ERROR:  cannot execute SQL commands in WAL sender for physical replication
2021-06-17 08:44:17.939 GMT [143] ERROR:  cannot execute SQL commands in WAL sender for physical replication
2021-06-17 08:44:17.939 GMT [91] ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/patroni/postgresql/__init__.py", line 738, in get_replica_timeline
    cur.execute('IDENTIFY_SYSTEM')
psycopg2.errors.InternalError_: cannot execute SQL commands in WAL sender for physical replication

2021-06-17 08:44:17.941 GMT [91] ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/patroni/postgresql/__init__.py", line 738, in get_replica_timeline
    cur.execute('IDENTIFY_SYSTEM')
psycopg2.errors.InternalError_: cannot execute SQL commands in WAL sender for physical replication

2021-06-17 08:44:18.036 GMT [144] ERROR:  cannot execute SQL commands in WAL sender for physical replication
2021-06-17 08:44:18.37 GMT [91] ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 1427, in _run_cycle
    return self.process_unhealthy_cluster()
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 944, in process_unhealthy_cluster
    if self.is_healthiest_node():
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 803, in is_healthiest_node
    return self._is_healthiest_node(members.values())
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 671, in _is_healthiest_node
    if my_timeline < cluster_timeline:
TypeError: '<' not supported between instances of 'NoneType' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/patroni/postgresql/__init__.py", line 738, in get_replica_timeline
    cur.execute('IDENTIFY_SYSTEM')
psycopg2.errors.InternalError_: cannot execute SQL commands in WAL sender for physical replication

2021-06-17 08:44:18.134 GMT [91] INFO: Register service postgres, params {'service_id': 'postgres/pg-one-photon', 'address': '192.168.50.91', 'port': 5432, 'check': {'http': 'http://192.168.50.91:28008/replica', 'interval': '5s', 'DeregisterCriticalServiceAfter': '100.0s'}, 'tags': ['replica']}
2021-06-17 08:44:18.143 GMT [91] ERROR: Unexpected exception
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 1457, in run_cycle
    info = self._run_cycle()
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 1427, in _run_cycle
    return self.process_unhealthy_cluster()
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 944, in process_unhealthy_cluster
    if self.is_healthiest_node():
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 803, in is_healthiest_node
    return self._is_healthiest_node(members.values())
  File "/usr/lib/python3.7/site-packages/patroni/ha.py", line 671, in _is_healthiest_node
    if my_timeline < cluster_timeline:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
2021-06-17 08:44:18.143 GMT [91] INFO: Unexpected exception raised, please report it as a BUG

Have you checked PostgreSQL logs? In the logs above

Have you tried to use GitHub issue search? Yes, no luck.

CyberDem0n commented 3 years ago

Yesterday the new version of psycopg2 was released, it seems that autocommit = True is ignored in the with connection block: https://github.com/psycopg/psycopg2/commit/e5ad0ab2d91310ac8e3fff534311a135502fff61

I can suggest to stick with psycopg2==2.8.6 for now.

ponvenkates commented 3 years ago

@CyberDem0n Spot on. That solves my problem, Thanks a lot. Much appreciated. I will close the bug.

dvarrazzo commented 3 years ago

Hello. with connection is documented as a way to start a transaction. On autocommit transactions, it was a no-op operation (the connection is not closed at the end of the with block), but that was a design error.

If you need an autocommit transaction you shouldn't use with.

honglei commented 3 years ago

https://github.com/psycopg/psycopg2/issues/941

patroni / patroni

Patroni Error in process_unhealthy_cluster #1969