timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
261 stars 223 forks source link

pgbackrest info missing stanza path when BOOTSTRAP_FROM_BACKUP=1 #628

Open lazzarello opened 8 months ago

lazzarello commented 8 months ago

patroni fails to bootstrap from backup in restore_or_initdb method. This script cannot execute pgbackrest info when the environment variable BOOTSTRAP_FROM_BACKUP=1

I would like to bootstrap a new deployment from S3 backups, which are functioning correctly. It appears that pgbackrest needs postgres to be running to create a stanza, which is required to bootstrap postgres from a backup. Postgres can't start yet because it doesn't have the backup from which to start, which means the pgbackrest stanza cannot be created.

Feels like a race condition.

output from interactive container session

patroni

postgres@postgres-bootstrap-restore-0:~$ patroni /etc/timescaledb/patroni.yaml
2023-11-03 23:00:08,358 WARNING: Retry got exception: 'connection problems'
/var/run/postgresql:5432 - no response
2023-11-03 23:00:08,364 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Sourcing /home/postgres/.pod_environment
2023-11-03 23:00:08 - restore_or_initdb - Attempting restore from backup
2023-11-03 23:00:08 - restore_or_initdb - Listing available backup information
WARN: environment contains invalid option 'backup-enabled'
WARN: configuration file contains invalid option 'repo1Path'
stanza: poddb
    status: error (missing stanza path)
WARN: environment contains invalid option 'backup-enabled'
WARN: configuration file contains invalid option 'repo1Path'
2023-11-03 23:00:08.390 P00   INFO: restore command begin 2.44: --config=/etc/pgbackrest/pgbackrest.conf --exec-id=410-54be1d5a --link-all --log-level-console=detail --pg1-path=/var/lib/postgresql/data --process-max=4 --repo1-cipher-type=none --repo1-path=/default/postgres-timescale --spool-path=/var/run/postgresql --stanza=poddb
WARN: repo1: [FileMissingError] unable to load info file '/default/postgres-timescale/backup/poddb/backup.info' or '/default/postgres-timescale/backup/poddb/backup.info.copy':
      FileMissingError: unable to open missing file '/default/postgres-timescale/backup/poddb/backup.info' for read
      FileMissingError: unable to open missing file '/default/postgres-timescale/backup/poddb/backup.info.copy' for read
      HINT: backup.info cannot be opened and is required to perform a backup.
      HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2023-11-03 23:00:08.390 P00   INFO: restore command end: aborted with exception [075]
2023-11-03 23:00:08 - restore_or_initdb - Bootstrap from backup failed
2023-11-03 23:00:08,721 WARNING: Retry got exception: 'connection problems'
/var/run/postgresql:5432 - no response
2023-11-03 23:00:08,727 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Traceback (most recent call last):
  File "/usr/bin/patroni", line 33, in <module>
    sys.exit(load_entry_point('patroni==2.1.4', 'console_scripts', 'patroni')())
  File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/lib/python3/dist-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1710, in run_cycle
    info = self._run_cycle()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1548, in _run_cycle
    return self.post_bootstrap()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1440, in post_bootstrap
    self.cancel_initialization()
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 1433, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

pgbackrest

postgres@postgres-bootstrap-restore-0:~$ pgbackrest info
WARN: environment contains invalid option 'backup-enabled'
WARN: configuration file contains invalid option 'repo1Path'
stanza: poddb
    status: error (missing stanza path)
postgres@postgres-bootstrap-restore-0:~$ pgbackrest --stanza=poddb stanza-create --log-level-stderr=info || exit 1
WARN: environment contains invalid option 'backup-enabled'
WARN: configuration file contains invalid option 'repo1Path'
INFO: stanza-create command begin 2.44: --config=/etc/pgbackrest/pgbackrest.conf --exec-id=465-374b9859 --log-level-stderr=info --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --repo1-cipher-type=none --repo1-path=/default/postgres-timescale --stanza=poddb
WARN: unable to check pg1: [DbConnectError] unable to connect to 'dbname='postgres' port=5432 host='/var/run/postgresql'': connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
        Is the server running locally and accepting connections on that socket?
ERROR: [056]: unable to find primary cluster - cannot proceed
       HINT: are all available clusters in recovery?
INFO: stanza-create command end: aborted with exception [056]
exit
command terminated with exit code 1

Reproduction instructions

helm upgrade --install --namespace troubleshooting -f values.yaml postgres-timescale-restore .

values.yaml

image:
  repository: timescale/timescaledb-ha
  tag: pg14.6-ts2.9.3-patroni-dcs-failsafe-p0
  pullPolicy: IfNotPresent
persistentVolumes:
  data:
    size: 10Gi
    storageClass: ebs-sc
  wal:
    size: 10Gi
    storageClass: ebs-sc
nodeSelector:
  eks.amazonaws.com/nodegroup: timescaledb-20231023223903624400000001
patroni:
  bootstrap:
    dcs:
      postgresql:
        parameters:
          max_worker_processes: 64  # Must be > max_background_workers + max_worker_processes (default 8)
          max_parallel_workers: 32
          timescaledb.max_background_workers: 32
secrets:
  pgbackrest:
    PGBACKREST_REPO1_S3_REGION: "us-gov-west-1"
    PGBACKREST_REPO1_S3_KEY: "value"
    PGBACKREST_REPO1_S3_KEY_SECRET: "value"
    PGBACKREST_REPO1_S3_BUCKET: "timescaledb-wal-backups-dev"
    PGBACKREST_REPO1_S3_ENDPOINT: "s3.us-gov-west-1.amazonaws.com"
bootstrapFromBackup:
  enabled: True
  repo1-path: /default/postgres-timescale
backup:
  enabled: false
  pgBackRest:
    repo1-path: /default/postgres-timescale

Environment

Chart is a fork of 0.33.1 with this emptyDir PR merged

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.17-eks-be83b3c", GitCommit:"d6adb24671245f68ce3cd985f6b68f124953968d", GitTreeState:"clean", BuildDate:"2023-09-27T17:22:23Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

Cluster type is EKS in AWS GovCloud created from Terraform. I'm working on a fix, so may have a PR ready in the coming week.

lazzarello commented 7 months ago

it appears my only problem was that the chart doesn't use the environment variable values from secrets.pgbackrest: {} to write /etc/pgbackrest/pgbackrest.conf

so...configuration problem. Will update documentation and perhaps fix the chart to use env vars for configuration.