timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
261 stars 223 forks source link

pgbackrest_restore.sh exits despite backup being enabled #631

Open theelderbeever opened 7 months ago

theelderbeever commented 7 months ago

Cross posting from slack message

What happened?

We have backup enabled for pgbackrest in our self hosted chart. There is a corresponding PGBACKREST_BACKUP_ENABLED environment variable that is set to true when exec-ing into the pod. Upon running reinit or adding a replica to our HA cluster the we see a message in our logs indicating the that pgbackrest restore has exited with a 1.

The pgbackrest_restore.sh should only do this if it exits on the environment variable being not true.

ERROR: Error creating replica using method pgbackrest: /etc/timescaledb/scripts/pgbackrest_restore.sh exited with code=1

This seemingly also as the effect where if we are writing to the primary while the replica is initializing it will start failing to find WAL files and never manage to switch over to the primary streaming replication. The only workaround thusfar is to stop all writes to the primary while the replica creates.

Did you expect to see something different? The backup from pgbackrest should succeed.

How to reproduce it (as minimally and precisely as possible):

Environment

timescaledb-single:
  replicaCount: 3
  image:
    tag: pg15.4-ts2.12.2-all
  secrets:
    credentialsSecretName: "billing-platform-timescaledb-patroni"
    pgbackrestSecretName: "billing-platform-timescaledb-pgbackrest"

  podManagementPolicy: Parallel

  backup:
    enabled: true
    pgBackRest:
      compress-type: lz4
      process-max: 4
      start-fast: "y"
      repo1-retention-diff: 2
      repo1-retention-full: 2
      repo1-cipher-type: "none"
      repo1-type: s3
      repo1-s3-region: us-east-1
      repo1-s3-endpoint: s3.amazonaws.com

    pgBackRest:archive-push:
      process-max: 4
      archive-async: "y"

    pgBackRest:archive-get:
      process-max: 4
      archive-async: "y"
      archive-get-queue-max: 2GB

  patroni:
    log:
      level: WARNING
    # https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#bootstrap
    bootstrap:
      dcs:
        synchronous_mode: true
        synchronous_node_count: 1
        master_start_timeout: 0
        postgresql:
          use_slots: false # https://github.com/timescale/helm-charts/blob/timescaledb-single-0.33.1/charts/timescaledb-single/examples/high_throughput.example.yaml-values.yaml
          parameters:
            max_wal_size: 16384
            wal_keep_size: 1024
            wal_segment_size: 67108864 # 64MB
            checkpoint_timeout: 300s
            temp_file_limit: '1024GB'
            max_connections: 1000
            synchronous_commit: remote_apply

  # Values for defining the primary & replica Kubernetes Services.
  service:
    primary:
      type: LoadBalancer
      port: 5432

    replica:
      type: LoadBalancer
      port: 5432

  persistentVolumes:
    data:
      enabled: true
      size: 3Ti
      storageClass: gp3-iops16k
    wal:
      enabled: false
      size: 100Gi
      storageClass: gp3-iops16k
  resources:
    limits:
      cpu: 16000m
      memory: 128Gi
    requests:
      cpu: 16000m
      memory: 128Gi

  sharedMemory:
    useMount: true

  pgBouncer:
    enabled: true
    port: 6432

  prometheus:
    enabled: true

AWS EKS

Anything else we need to know?: