timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
261 stars 223 forks source link

ERROR: Error creating replica using method pgbackrest #574

Closed mindrunner closed 1 year ago

mindrunner commented 1 year ago

What did you do?

When starting a pod with empty storage it usually restores from azure backup. This works without issues for the first pod in statefulset after installing the chart. However, on every subsequent pod, I only see the error:

ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1 

and patroni restores the database with

pg_basebackup --pgdata=/var/lib/postgresql/data -X stream --dbname=dbname=postgres user=standby host=10.244.2.8 port=5432 --waldir=/var/lib/postgresql/wal/pg_wa

which comes with several downsides, (e.g. very slow compared to pgbackrest, error prone, wal volume fills up, etc)

This used to be different. Every new pod was restored by pgbackrest without any issues.

I am not sure if a config change on my side is the problem or if the chart might have a bug.

Environment

timescaledb-single-0.33.1
timescale/timescaledb-ha:pg14.6-ts2.9.2-p0 
image:
  repository: timescale/timescaledb-ha
  tag: pg14.6-ts2.9.2-p0

replicaCount: 2

nodeSelector:
  hardware: highmem32

tolerations:
  - key: hardware
    operator: Equal
    value: highmem32
    effect: NoSchedule

prometheus:
  enabled: true

resources:
  requests:
    cpu: "3"
    memory: 24Gi
  limits:
     memory: 32Gi

persistentVolumes:
  data:
    enabled: true
    storageClass: managed-csi-premium
    size: 2048Gi
  wal:
    enabled: true
    storageClass: managed-csi-premium
    size: 128Gi

patroni:
  log:
    level: WARNING
  postgresql:
    ssl: 'off'
    listen: 0.0.0.0:5432
    pg_hba:
      - local  all             postgres                              peer
      - local  all             all                                   md5
      - host   all             all                all                md5
      - host   all             all                127.0.0.1/32       md5
      - host   all             all                ::1/128            md5
      - host   replication     standby            all                md5
      - host   all             all                all                md5
    use_unix_socket: true
  bootstrap:
    dcs:
      synchronous_mode: true
      master_start_timeout: 0
      postgresql:
        use_slots: false
        parameters:
          checkpoint_timeout: 310s
          temp_file_limit: -1
          synchronous_commit: remote_apply
          shared_buffers: 16GB
          effective_cache_size: 24GB
          maintenance_work_mem: 2047MB
          work_mem: 64MB
          timescaledb.max_background_workers: 4
          max_worker_processes: 27
          max_parallel_workers_per_gather: 4
          max_parallel_workers: 8
          wal_buffers: 16MB
          min_wal_size: 512MB
          default_statistics_target: 500
          random_page_cost: 1.1
          checkpoint_completion_target: 0.9
          max_connections: 250
          max_locks_per_transaction: 512
          autovacuum_max_workers: 10
          autovacuum_naptime: 10
          effective_io_concurrency: 256
          max_standby_archive_delay: 900s
          max_standby_streaming_delay: 900s
  restapi:
    listen: 0.0.0.0:8008

timescaledbTune:
  enabled: true
  args:
    max-conns: 250
    cpus: 4
    memory: 32GB

bootstrapFromBackup:
  enabled: true
  repo1-path: "/tsdb-backup"

backup:
  enabled: true
  pgBackRest:
    compress-type: lz4
    process-max: 4
    start-fast: "y"
    repo1-azure-account: tsdbbackupprod
    repo1-azure-container: tsdb
    repo1-cipher-type: "none"
    repo1-path: "/tsdb-backup"
    repo1-retention-diff: 2
    repo1-retention-full: 2
    repo1-type: azure
    archive-async: "y"
    archive-get-queue-max: 4GB
    spool-path: /var/lib/postgresql/wal/spool
  jobs:
    - name: full-weekly
      type: full
      schedule: "12 02 * * 0"
    - name: incremental-daily
      type: incr
      schedule: "12 02 * * 1-6"
  envFrom:
    - secretRef:
        name: tsdb-pgbackrest

secrets:
  pgbackrestSecretName: "tsdb-pgbackrest"
  certificateSecretName: "tsdb-certificate"
  credentialsSecretName: "tsdb-credentials"
Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.25.5
mindrunner commented 1 year ago

Manually running /etc/timescaledb/scripts/restore_or_initdb.sh or /etc/timescaledb/scripts/pgbackrest_restore.sh seems to work :/ (e.g. it starts the restore but this is definitely not the way of doing it. bad things happen then...)

mindrunner commented 1 year ago

It seems like the code 1 exit comes from here:

[ "${PGBACKREST_BACKUP_ENABLED}" = "true" ] || exit 1

Commenting out this line results in backup being restored from archive on startup with a new volume.

Edit: this restore results in a non-working standby node. Idk what I am doing wrong.

Need help...

mindrunner commented 1 year ago

@paulfantom I see a lot of backup/restore related commits from you in december. Maybe something changed which I do not understand? Or maybe there is a regression?

timclifford commented 1 year ago

+1

mindrunner commented 1 year ago

+1

Funny, you're posting this just in the moment where I have struggle again and no clue what's going on. Idk how much time I have to investigate, bcs I am not really getting paid anymore for this stuff. Do you mind elaborating on your +1?

timclifford commented 1 year ago

Essentially I have the same issue - backups are not being triggered in the replicas even though PGBACKREST_BACKUP_ENABLED : true is set:

2023-06-02 08:24:52 - bootstrap - Waiting for PostgreSQL to become available
2023-06-02 08:25:05 - bootstrap - Starting pgBackrest api to listen for backup requests
2023-06-02 08:25:05,382 - INFO - backup - Starting loop waiting for backup events
2023-06-02 08:25:06,384 - INFO - history - Refreshing backup history using pgbackrest
2023-06-02 08:25:06,384 - DEBUG - backup - Waiting until backup triggered

The backup jobs themselves to s3 are also continuously running (but not sure that's an issue):

HTTP/1.0 202 Accepted
Server: BaseHTTP/0.6 Python/3.10.6
Date: Fri, 02 Jun 2023 00:18:02 GMT
Location: /backups/backup/20230602001801
Content-Type: application/json

{
    "age": 1.0,
    "duration": 1.0,
    "finished": null,
    "label": "20230602001801",
    "pgbackrest": {},
    "pid": 530,
    "returncode": null,
    "started": "2023-06-02T00:18:01+00:00",
    "status": "RUNNING"
}
mindrunner commented 1 year ago

I don't see where your issue is related. This Ticket was about creating replicas, but you're talking about making backups.

Furthermore, it looks like your backup Job is running. Backups always run on master, never on replicas/slaves.

timclifford commented 1 year ago

According to the docs every new replica will attempt to copy from an s3 backup (if available) but on creation of the pod I get this:

2023-06-02 06:54:43,219 ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1
/var/run/postgresql:5432 - no response

So the replicas aren't been created using pgbrackrest caused by /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1. The repilica is runnign and accepting connectiosn but now assume there is a sync issue between replica and master

mindrunner commented 1 year ago

Yes, that look more like the problem. I can confirm this behaviour here as well. Te replica creates itself from the master or another replica if available as a fallback. This takes way longer (in my case) which can be pain on large instances.

mindrunner commented 1 year ago

Turned out, that I forgot to bump my chart but reverting my local changes to fix that issue. Testing with most recent one now. So It is supposed to be fixed in 0.33.2. Which chart version are you using?

mindrunner commented 1 year ago

oof, they never released the 0.33.2.... :( This project seems really chaotic. Hang on. I will provide a workaround....

mindrunner commented 1 year ago

To use the most recent changes with chart 0.33.1, add the following to your values.yml:

patroni:
  postgresql:
    pgbackrest:
      command: /home/postgres/pgbackrest_restore.sh
    recovery_conf:
      restore_command: /home/postgres/pgbackrest_archive_get.sh %f "%p"

[..]

debug:
  execStartPre: curl -o /home/postgres/pgbackrest_restore.sh https://raw.githubusercontent.com/timescale/helm-charts/main/charts/timescaledb-single/scripts/pgbackrest_restore.sh && chmod +x /home/postgres/pgbackrest_restore.sh && curl -o /home/postgres/pgbackrest_archive_get.sh https://raw.githubusercontent.com/timescale/helm-charts/main/charts/timescaledb-single/scripts/pgbackrest_archive_get.sh && chmod +x /home/postgres/pgbackrest_archive_get.sh

@mathisve why so sloppy? :(

Also see https://github.com/timescale/helm-charts/issues/596

Don't leave the community behind....

timclifford commented 1 year ago

I'm using 0.33.1 will test your changes now. Shame if these charts aren't being supported now tbh

timclifford commented 1 year ago
2023-06-02 10:49:30.732 P00   INFO: restore command begin 2.43: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=29-3b5be8d6 --force --link-all --log-level-console=detail --pg1-path=/var/lib/postgresql/data --process-max=4 --repo1-cipher-type=none --repo1-path=/xxxxxx/timescaledb/ --repo1-s3-bucket=timescaledb-xxxx-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-west-1 --repo1-type=s3 --spool-path=/var/run/postgresql --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/var/lib/postgresql/data' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2023-06-02 10:49:30.868 P00   INFO: repo1: restore backup set 20230602-065746F, recovery will start at 2023-06-02 06:57:46
2023-06-02 10:49:30.868 P00 DETAIL: check '/var/lib/postgresql/data' exists
2023-06-02 10:49:30.869 P00 DETAIL: check '/var/lib/postgresql/wal/pg_wal' exists

...

2023-06-02 10:49:41.954 P00 DETAIL: sync path '/var/lib/postgresql/data/pg_wal/archive_status'
2023-06-02 10:49:41.954 P00 DETAIL: sync path '/var/lib/postgresql/data/pg_xact'
2023-06-02 10:49:41.955 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2023-06-02 10:49:41.956 P00 DETAIL: sync path '/var/lib/postgresql/data/global'
2023-06-02 10:49:41.957 P00   INFO: restore size = 39.8MB, file total = 1745
2023-06-02 10:49:41.958 P00 DETAIL: statistics: {"http.client":{"total":1},"http.request":{"total":2},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2023-06-02 10:49:41.958 P00   INFO: restore command end: completed successfully (11229ms)

That seems to have done it for me @mindrunner 👍

timclifford commented 1 year ago

It doesn't make sense to me why the new script works but the existing one (/etc/timescaledb/scripts/pgbackrest_restore.sh) doesn't when the only change I can see is just that the [ "${PGBACKREST_BACKUP_ENABLED}" = "true" ] || exit 1 condition have been moved to the top?

mindrunner commented 1 year ago

See PR for explanation

https://github.com/timescale/helm-charts/pull/581

timclifford commented 1 year ago

Ah I see, sourcing in the env_file is required to access PGBACKREST_BACKUP_ENABLED which must default to false