timescale / helm-charts

Configuration and Documentation to run TimescaleDB in your Kubernetes cluster
Apache License 2.0
263 stars 223 forks source link

bug report: timescaledb-tune values are overwritten #198

Open mo-rieger opened 4 years ago

mo-rieger commented 4 years ago

Hi, we are running the timescaledb-single chart (v0.6.2) with pg12.3-ts1.7.2-latest image on Kubernetes. We realized that all tuned values from /var/run/postgresql/timescaledb.conf are included in postgresql.conf. As I read in the patroni docs, patroni renames the postgresql.conf to postgresql.base.conf.

When we look at the current postgresql.conf of a running pod we see that all tuned values are overwritten as the postgresql.base.conf file is included at the top of the file.

We have no clue where the values that override the tuned values come from. We did not specify them in the patroni.bootstrap.dcs.postgresql.parameters.

timescaledb.conf

shared_preload_libraries = 'timescaledb'
shared_buffers = 1125MB
effective_cache_size = 3375MB
maintenance_work_mem = 576000kB
work_mem = 7680kB
wal_buffers = 16MB
min_wal_size = 3GB
max_wal_size = 6GB
default_statistics_target = 500
random_page_cost = 1.1
checkpoint_completion_target = 0.9
max_connections = 75
max_locks_per_transaction = 64
autovacuum_max_workers = 10
autovacuum_naptime = 10
effective_io_concurrency = 200
timescaledb.last_tuned = '2020-08-13T11:45:53Z'
timescaledb.last_tuned_version = '0.9.0'
max_wal_size=6128MB
min_wal_size=3064MB

postgres.conf

# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

archive_command = '/etc/timescaledb/scripts/pgbackrest_archive.sh %p'
archive_mode = 'on'
archive_timeout = '1800s'
autovacuum_analyze_scale_factor = '0.02'
autovacuum_max_workers = '10'
autovacuum_naptime = '10'
autovacuum_vacuum_cost_limit = '500'
autovacuum_vacuum_scale_factor = '0.05'
checkpoint_completion_target = '0.9'
checkpoint_timeout = '310s'
cluster_name = 'default-server'
default_statistics_target = '500'
effective_cache_size = '3GB'
effective_io_concurrency = '200'
hot_standby = 'on'
listen_addresses = '0.0.0.0'
log_autovacuum_min_duration = '1min'
log_checkpoints = 'on'
log_connections = 'on'
log_disconnections = 'on'
log_line_prefix = '%t [%p]: [%c-%l] %u@%d,app=%a [%e] '
log_lock_waits = 'on'
log_min_duration_statement = '1s'
log_statement = 'ddl'
maintenance_work_mem = '512MB'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '150'
max_replication_slots = '10'
max_wal_senders = '10'
max_wal_size = '6GB'
max_worker_processes = '8'
min_wal_size = '3GB'
port = '5432'
random_page_cost = '1.1'
shared_buffers = '1GB'
shared_preload_libraries = 'timescaledb,pg_stat_statements'
ssl = 'on'
ssl_cert_file = '/etc/certificate/tls.crt'
ssl_key_file = '/etc/certificate/tls.key'
synchronous_commit = 'remote_apply'
tcp_keepalives_idle = '900'
tcp_keepalives_interval = '100'
temp_file_limit = '2GB'
timescaledb.passfile = '../.pgpass'
track_commit_timestamp = 'off'
unix_socket_directories = '/var/run/postgresql'
unix_socket_permissions = '0750'
wal_buffers = '16MB'
wal_keep_segments = '8'
wal_level = 'replica'
wal_log_hints = 'on'
work_mem = '10485kB'
hba_file = '/var/lib/postgresql/data/pg_hba.conf'
ident_file = '/var/lib/postgresql/data/pg_ident.conf'

# recovery.conf
recovery_target = ''
recovery_target_lsn = ''
recovery_target_name = ''
recovery_target_time = ''
recovery_target_timeline = 'latest'
recovery_target_xid = ''
restore_command = '/etc/timescaledb/scripts/pgbackrest_archive_get.sh %f "%p"'

patroni.yaml

bootstrap:
  dcs:
    loop_wait: 10
    master_start_timeout: 0
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_command: /etc/timescaledb/scripts/pgbackrest_archive.sh %p
        archive_mode: "on"
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 10
        autovacuum_naptime: 5s
        autovacuum_vacuum_cost_limit: 500
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_timeout: 310s
        hot_standby: "on"
        log_autovacuum_min_duration: 1min
        log_checkpoints: "on"
        log_connections: "on"
        log_disconnections: "on"
        log_line_prefix: '%t [%p]: [%c-%l] %u@%d,app=%a [%e] '
        log_lock_waits: "on"
        log_min_duration_statement: 1s
        log_statement: ddl
        max_connections: 100
        max_prepared_transactions: 150
        shared_preload_libraries: timescaledb,pg_stat_statements
        ssl: "on"
        ssl_cert_file: /etc/certificate/tls.crt
        ssl_key_file: /etc/certificate/tls.key
        synchronous_commit: remote_apply
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        temp_file_limit: 2GB
        timescaledb.passfile: ../.pgpass
        unix_socket_directories: /var/run/postgresql
        unix_socket_permissions: "0750"
        wal_level: hot_standby
        wal_log_hints: "on"
      use_pg_rewind: true
      use_slots: false
    retry_timeout: 10
    synchronous_mode: true
    ttl: 30
  method: restore_or_initdb
  post_init: /etc/timescaledb/scripts/post_init.sh
  restore_or_initdb:
    command: |
      /etc/timescaledb/scripts/restore_or_initdb.sh --encoding=UTF8 --locale=C.UTF-8
    keep_existing_recovery_conf: true
kubernetes:
  ports:
  - name: postgresql
    port: 5432
    targetPort: 5432
  role_label: role
  scope_label: cluster-name
  use_endpoints: true
log:
  level: WARNING
postgresql:
  authentication:
    replication:
      username: standby
    superuser:
      username: postgres
  basebackup:
  - waldir: /var/lib/postgresql/wal/pg_wal
  callbacks:
    on_reload: /etc/timescaledb/scripts/patroni_callback.sh
    on_restart: /etc/timescaledb/scripts/patroni_callback.sh
    on_role_change: /etc/timescaledb/scripts/patroni_callback.sh
    on_start: /etc/timescaledb/scripts/patroni_callback.sh
    on_stop: /etc/timescaledb/scripts/patroni_callback.sh
  create_replica_methods:
  - pgbackrest
  - basebackup
  listen: 0.0.0.0:5432
  pg_hba:
  - hostnossl all,replication all                all                reject
  - local     all             all                                   peer
  - hostssl   all             all                127.0.0.1/32       md5
  - hostssl   all             all                ::1/128            md5
  - hostssl   replication     standby            all                md5
  - hostssl   all             all                all                md5
  pgbackrest:
    command: /etc/timescaledb/scripts/pgbackrest_restore.sh
    keep_data: true
    no_master: 1
    no_params: true
  recovery_conf:
    restore_command: /etc/timescaledb/scripts/pgbackrest_archive_get.sh %f "%p"
  use_unix_socket: true
restapi:

Even when we specify values, for example, max_connections explicitly in patroni.bootstrap.dcs.postgresql.parameters the patroni.yaml reflect these values but they still get overwritten by some value set in postgresql.conf.

We restarted the pod and the whole helm release but this had no effect on the postgresql config.

Any Idea what we are missing here?

feikesteenbergen commented 4 years ago

You might be running into something that is not taken care of (yet) in these Helm charts:

Some of the PostgreSQL parameters must hold the same values on the master and the replicas. For those, values set either in the local patroni configuration files or via the environment variables take no effect. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default values:

https://patroni.readthedocs.io/en/latest/dynamic_configuration.html

This includes max_connections.

For Patroni to change these values a certain sequence of events need to take place, the value needs to be changed in the DCS (which in these Helm Charts is the -config Kubernetes endpoint).

If you do a helm upgrade --install RELEASE -f myvalues.yaml, there's a Job scheduled that will trigger the DCS change.

feikesteenbergen commented 4 years ago

One more thing:

We did not specify them in the patroni.bootstrap.dcs.postgresql.parameters.

But the default values.yaml does have a list of defaults. If you do not specify any values, the defaults from values.yaml are being used.

Could you share your values.yaml, and the Chart version you're using?

mo-rieger commented 4 years ago

Thanks for your fast response @feikesteenbergen

  1. Did you change the value in the ConfigMap manually? -> no
  2. Did you do a helm upgrade after changing the values? -> yes
  3. What does patronictl list show you inside one of the pods?

Since this is a dev cluster we run only one pod. Here is the output of patronictl list

postgres@default-server-0:~$ patronictl -c /etc/timescaledb/patroni.yaml list
+ Cluster: default-server (6854805892958351409) ----+----+-----------+-----------------+
|      Member      |     Host    |  Role  |  State  | TL | Lag in MB | Pending restart |
+------------------+-------------+--------+---------+----+-----------+-----------------+
| default-server-0 | 10.244.4.55 | Leader | running | 37 |           |        *        |

The pending restart flag made me think. I run patronictl -c /etc/timescaledb/patroni.yaml restart default-server and voilá the config finally gets applied like I specified it in the chart below.

Now I started experimenting, and observed the following:

If you remove some values from the patroni.bootstrap.dcs.postgresql.parameters you set before. These values are not reset and will still be used and they will overwrite the ones the tstune initContainer set in timescaledb.conf. I confess max_connections was a bad example, this will always be overwritten by the chart default value, but this also happens to other values which have no default like shared_buffers.

If you change a value from patroni.bootstrap.dcs.postgresql.parameters of the list below. patronictl list shows the pending restart flag, and you have to perform the patroniclt restart manually, but shouldn't that be done by the patroni-job which is running after the helm-upgrade?

values.yaml

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: default-server
  namespace: wpb-database
  annotations:
    fluxcd.io/ignore: 'false'
    test/ns: 'true'
spec:
  releaseName: default-server
  helmVersion: v3
  chart:
    repository: https://raw.githubusercontent.com/timescale/timescaledb-kubernetes/master/charts/repo/
    name: timescaledb-single
    version: 0.6.2
  values:
    replicaCount: 1
    nameOverride: default-server

    image:
      repository: timescaledev/timescaledb-ha
      tag: pg12.3-ts1.7.2-latest
      pullPolicy: IfNotPresent

    secretNames:
      credentials: database-credentials
      certificate: database-certificate
      pgbackrest: pgbackrest-credentials

    backup:
      enabled: true
      pgBackRest:
        # https://pgbackrest.org/configuration.html
        repo1-s3-bucket: timescaledb-backups
        repo1-s3-endpoint: minio.wpb-system
        repo1-s3-host: minio.wpb-system
        repo1-s3-port: 9000
        repo1-s3-verify-tls: 'n'

      # For High Throughput clusters, the archiver will likely not be able to keep up if running synchronously.
      # Therefore, we'll need to enable asynchronous archiving.
      #
      # Not using asynchronous archiving may cause failures far in the future, for example, on a
      # cluster with a WAL Volume of 500GB, and an archiver lagging 200MB/minute, the primary pod will only
      # fail after 2500 minutes, or almost 2 days.
      pgBackRest:archive-push:
        process-max: 4
        archive-async: "y"

      pgBackRest:archive-get:
        process-max: 4
        archive-async: "y"
        archive-get-queue-max: 2GB

    patroni:
      bootstrap:
        dcs:
          # For High Throughput clusters it is likely that the bottleneck will be the recovery
          # of the WAL on the replica's. If the high throughput is continuous, this means
          # the replica's will never catch up.
          # Enabling synchronous_mode will ensure that there is at least 1 synchronous replica,
          # which will ensure that the throughput will be limited to whatever the replica process
          # of the synchronous replica can process.
          #
          # https://patroni.readthedocs.io/en/latest/replication_modes.html#synchronous-mode
          synchronous_mode: true
          master_start_timeout: 0
          postgresql:
            # Using replication slots in high throughput situations has a significant danger:
            # The master may need to retain more WAL than the size of the WAL Volume allows,
            # if a replica is catching up.
            #
            # With backup enabled: while the replica is catching up it will be fetching archives
            # from s3, until s3 is exhausted, after which the replica will switch to streaming
            # from the master. While the replica is not yet streaming (this may take hours),
            # the replication slot reserved for it is stale and will cause WAL to be retained by
            # the master.
            # If we have a backup, we don't need replication slots, as s3 will have the WAL, therefore,
            # if the backup is enabled, we can safely disable the use of replication slots.
            use_slots: false
            parameters:
              # For good performance we want to have a long interval between checkpoints,
              # however for continuous high throughput, the replica's will have very limited
              # capacity to catch up after a crash/restart.
              #
              # For example, if the recovery process of the replica requires on average
              # 90% of the CPU, it could recover 1.11 seconds of WAL per second.
              # If after a crash it is delayed by 5 minutes, it will only catch up in:
              #
              #   300/(1.11 - 1) = 2727 seconds (roughly 45 minutes)
              #
              # We therefore want to pick a middle ground between recoverability of instances
              # and performance, and for now come up with 5 minutes.
              checkpoint_timeout: 310s
              temp_file_limit: 2GB
              # remote_apply will throttle the replica's, see also the comment at patroni.bootstrap.synchronous_mode
              synchronous_commit: remote_apply

              # timescaledb-tune recommendations
              shared_buffers: 1125MB
              effective_cache_size: 3375MB
              maintenance_work_mem: 576000kB
              work_mem: 7680kB
              wal_buffers: 16MB
              min_wal_size: 3GB
              max_wal_size: 6GB
              default_statistics_target: 500
              random_page_cost: 1.1
              checkpoint_completion_target: 0.9
              max_connections: 75
              max_locks_per_transaction: 64
              autovacuum_max_workers: 10
              autovacuum_naptime: 10
              effective_io_concurrency: 200

    loadBalancer:
      enabled: false

    persistentVolumes:
      data:
        size: 20Gi
        storageClass: "-"
      wal:
        size: 10Gi
        storageClass: "-"

    timescaledbTune:
      args:
        wal-disk-size: 10GB

    resources:
      requests:
        cpu: 1000m
        memory: 4500Mi

    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: agentpool
              operator: In
              values:
              - timescaledb
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            topologyKey: "kubernetes.io/hostname"
            labelSelector:
              matchLabels:
                app: default-server
                release: default-server
                cluster-name: default-server
        - weight: 50
          podAffinityTerm:
            topologyKey: failure-domain.beta.kubernetes.io/zone
            labelSelector:
              matchLabels:
                app: default-server
                release: default-server
                cluster-name: default-server
feikesteenbergen commented 4 years ago

but shouldn't that be done by the patroni-job which is running after the helm-upgrade?

It depends. In this Helm Chart we choose no as an answer. The patroni-job that is running does the following:

And that's it.

As soon as we would restart pods during the upgrade process, we would be enforcing a downtime on you during the helm upgrade command, which would defeat a lot of the purpose of these Helm Charts.

So, if you're going to change an option that requires a instance restart, you will have to do a rolling restart of all the pods, see the Kubernetes Reference, or use another mechanism to trigger Patroni to restart members.

Happy for any thoughts though on how this should or could work with these Charts!

mo-rieger commented 4 years ago

Thanks for the explanation, now this makes sense to me.

But I still wonder why the values from tstune never take effect, even the ones that don't require a restart and are not overwritten by the chart defaults, like work_mem or shared_buffers. I suppose it has something to do with the fact that I set them explicitly before and they are not reset properly, so the values from timescaledb.config are always overwritten with the old explicitly specified values. Does that make sense?

feikesteenbergen commented 4 years ago

In your example you do set work_mem:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
spec:
  values:
    patroni:
      bootstrap:
        dcs:
          postgresql:
            parameters:
              work_mem: 7680kB

I can't find shared_buffers though.

To troubleshoot, you could do:

$ kubectl exec -ti pod/<your pod> -c timescaledb -- bash $ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf

For example, this is the output that I get, (my shared_buffers are actually set to 7904MB, so I cannot reproduce your issue yet):

/var/lib/postgresql/data/postgresql.base.conf:shared_buffers = 128MB            # min 128kB
/var/lib/postgresql/data/postgresql.base.conf:#wal_buffers = -1         # min 32kB, -1 sets based on shared_buffers
/var/run/postgresql/timescaledb.conf:shared_buffers = 7904MB

$ tr ' ' '\n' < $PGDATA/postmaster.opts

"-D"
"/var/lib/postgresql/data"
"--config-file=/var/lib/postgresql/data/postgresql.conf"
"--listen_addresses=0.0.0.0"
"--port=5432"
"--cluster_name=wal02"
"--wal_level=replica"
"--hot_standby=on"
"--max_connections=100"
"--max_wal_senders=10"
"--max_prepared_transactions=150"
"--max_locks_per_transaction=64"
"--track_commit_timestamp=off"
"--max_replication_slots=10"
"--max_worker_processes=8"
"--wal_log_hints=on"
mo-rieger commented 4 years ago

Sorry for the confusion, when I said:

But I still wonder why the values from tstune never take effect, even the ones that don't require a restart and are not overwritten by the chart defaults, like work_mem or shared_buffers. I suppose it has something to do with the fact that I set them explicitly before and they are not reset properly, so the values from timescaledb.config are always overwritten with the old explicitly specified values. Does that make sense?

I meant I removed all the values, so my patroni config looked like this:

patroni:
      bootstrap:
        dcs:
          synchronous_mode: true
          master_start_timeout: 0
          postgresql:
            use_slots: false
            parameters:
              checkpoint_timeout: 310s
              temp_file_limit: 2GB
              synchronous_commit: remote_apply

Imo all the tuned values should then come from timescaledb.conf. But as you can see the values of timescaledb.conf are getting overwritten by the ones from postgresql.conf even though I did not set them in the values anymore.

postgres@default-server-0:~$ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf
postgres@default-server-0:~$ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf
/var/lib/postgresql/data/postgresql.base.conf:shared_buffers = 128MB                    # min 128kB
/var/lib/postgresql/data/postgresql.base.conf:#wal_buffers = -1                 # min 32kB, -1 sets based on shared_buffers
/var/lib/postgresql/data/postgresql.conf:shared_buffers = '1125MB'
/var/run/postgresql/timescaledb.conf:shared_buffers = 1GB
postgres@default-server-0:~$ tr ' ' '\n' < $PGDATA/postmaster.opts
/usr/lib/postgresql/12/bin/postgres
"-D"
"/var/lib/postgresql/data"
"--config-file=/var/lib/postgresql/data/postgresql.conf"
"--listen_addresses=0.0.0.0"
"--port=5432"
"--cluster_name=default-server"
"--wal_level=replica"
"--hot_standby=on"
"--max_connections=75"
"--max_wal_senders=10"
"--max_prepared_transactions=150"
"--max_locks_per_transaction=64"
"--track_commit_timestamp=off"
"--max_replication_slots=10"
"--max_worker_processes=8"
"--wal_log_hints=on"

This causes timescaledb-tune to have no effect if the values have been set before, because these values are not reset somewhere and will always overwrite those of timesaledb.conf.

feikesteenbergen commented 4 years ago

Hmm, that seems wrong,could you provide: kubectl get ep/$RELEASENAME-config -o jsonpath="{.metadata.annotations.config}"

feikesteenbergen commented 4 years ago

My guess is that you have to explicitly clear shared_buffers in the patroni configuration:

If you want to remove (reset) some setting just patch it with null:

https://patroni.readthedocs.io/en/latest/rest_api.html

So the way you got into this situation is:

I think that is not nice, or even worse a bug. I think we can also solve that, by not having the Patroni Config Job that is triggered by Helm to patch a Patroni endpoint (curl -X POST http://pod:8008/config), but by actually changing the annotation on that endpoint directly, followed by a reload request to Patroni.

But, before we do, I'd really like to see the endpoint config annotation.

Workaround for you for now:

kubectl edit ep/$RELEASENAME-config and remove the shared_buffers section

mo-rieger commented 4 years ago

Thanks @feikesteenbergen !

We already worked around this issue and just set all parameters from timescaledb-tune explicitly in dcs.postgresql.parameters followed by a patronictl restart.

But I set the cluster back in the state I described in the previous post to retrieve the desired endpoint config annotation.

Output of kubectl get ep/$RELEASENAME-config -o jsonpath="{.metadata.annotations.config}"

{
  "loop_wait": 10,
  "master_start_timeout": 0,
  "maximum_lag_on_failover": 33554432,
  "postgresql": {
    "parameters": {
      "archive_command": "/etc/timescaledb/scripts/pgbackrest_archive.sh %p",
      "archive_mode": "on",
      "archive_timeout": "1800s",
      "autovacuum_analyze_scale_factor": 0.02,
      "autovacuum_max_workers": 10,
      "autovacuum_naptime": "5s",
      "autovacuum_vacuum_cost_limit": 500,
      "autovacuum_vacuum_scale_factor": 0.05,
      "checkpoint_timeout": "310s",
      "hot_standby": "on",
      "log_autovacuum_min_duration": "1min",
      "log_checkpoints": "on",
      "log_connections": "on",
      "log_disconnections": "on",
      "log_line_prefix": "%t [%p]: [%c-%l] %u@%d,app=%a [%e] ",
      "log_lock_waits": "on",
      "log_min_duration_statement": "1s",
      "log_statement": "ddl",
      "max_connections": 100,
      "max_prepared_transactions": 150,
      "shared_preload_libraries": "timescaledb,pg_stat_statements",
      "ssl": "on",
      "ssl_cert_file": "/etc/certificate/tls.crt",
      "ssl_key_file": "/etc/certificate/tls.key",
      "synchronous_commit": "remote_apply",
      "tcp_keepalives_idle": 900,
      "tcp_keepalives_interval": 100,
      "temp_file_limit": "2GB",
      "timescaledb.passfile": "../.pgpass",
      "unix_socket_directories": "/var/run/postgresql",
      "unix_socket_permissions": "0750",
      "wal_level": "hot_standby",
      "wal_log_hints": "on",
      "checkpoint_completion_target": 0.9,
      "default_statistics_target": 500,
      "effective_cache_size": "3375MB",
      "effective_io_concurrency": 200,
      "maintenance_work_mem": "576000kB",
      "max_locks_per_transaction": 64,
      "max_wal_size": "6GB",
      "min_wal_size": "3GB",
      "random_page_cost": 1.1,
      "shared_buffers": "1125MB",
      "wal_buffers": "16MB",
      "work_mem": "7680kB"
    },
    "use_pg_rewind": true,
    "use_slots": false
  },
  "retry_timeout": 10,
  "synchronous_mode": true,
  "ttl": 30
}%  

As you can see your guess was right, there are still the old values I removed from the dcs.postgresql.parameters.

JialuZhang commented 3 years ago

In the configuration file, e.g., postgresql.conf, if you have:

wal_level = 'replica'
track_commit_timestamp = 'off'

The default value of “wal_level” is “replica”. Then, having setting “ track_commit_timestamp = 'off' “ will not influence any system behavior. To enable configuration “track_commit_timestamp” the “wal_level” must be changed other than the default “'replica'”.

Moreover, if the “wal_level” parameter and the “track_commit_timestamp” parameter are seperated in a large postgresql.conf (e.g., a lot of other parameters are in between), this will bring confusion to users.

Looking at the code snippet in PostgreSQL (see below), you can find that usage of “track_commit_timestamp” is "disabled".

if (wal_level != ControlFile-> wal_level...) Xlrec.track_commit_timestamp = track_commit_timestamp;