Open mo-rieger opened 4 years ago
You might be running into something that is not taken care of (yet) in these Helm charts:
Some of the PostgreSQL parameters must hold the same values on the master and the replicas. For those, values set either in the local patroni configuration files or via the environment variables take no effect. To alter or set their values one must change the shared configuration in the DCS. Below is the actual list of such parameters together with the default values:
https://patroni.readthedocs.io/en/latest/dynamic_configuration.html
This includes max_connections
.
For Patroni to change these values a certain sequence of events need to take place, the value needs to be changed in the DCS (which in these Helm Charts is the -config
Kubernetes endpoint).
If you do a helm upgrade --install RELEASE -f myvalues.yaml
, there's a Job scheduled that will trigger the DCS change.
helm upgrade
after changing the values?patronictl list
show you inside one of the pods?One more thing:
We did not specify them in the patroni.bootstrap.dcs.postgresql.parameters.
But the default values.yaml
does have a list of defaults. If you do not specify any values, the defaults from values.yaml
are being used.
Could you share your values.yaml
, and the Chart version you're using?
Thanks for your fast response @feikesteenbergen
Since this is a dev cluster we run only one pod. Here is the output of patronictl list
postgres@default-server-0:~$ patronictl -c /etc/timescaledb/patroni.yaml list
+ Cluster: default-server (6854805892958351409) ----+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+------------------+-------------+--------+---------+----+-----------+-----------------+
| default-server-0 | 10.244.4.55 | Leader | running | 37 | | * |
The pending restart flag made me think. I run patronictl -c /etc/timescaledb/patroni.yaml restart default-server
and voilá the config finally gets applied like I specified it in the chart below.
Now I started experimenting, and observed the following:
If you remove some values from the patroni.bootstrap.dcs.postgresql.parameters
you set before. These values are not reset and will still be used and they will overwrite the ones the tstune initContainer set in timescaledb.conf. I confess max_connections
was a bad example, this will always be overwritten by the chart default value, but this also happens to other values which have no default like shared_buffers
.
If you change a value from patroni.bootstrap.dcs.postgresql.parameters
of the list below. patronictl list
shows the pending restart flag, and you have to perform the patroniclt restart manually, but shouldn't that be done by the patroni-job which is running after the helm-upgrade?
values.yaml
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: default-server
namespace: wpb-database
annotations:
fluxcd.io/ignore: 'false'
test/ns: 'true'
spec:
releaseName: default-server
helmVersion: v3
chart:
repository: https://raw.githubusercontent.com/timescale/timescaledb-kubernetes/master/charts/repo/
name: timescaledb-single
version: 0.6.2
values:
replicaCount: 1
nameOverride: default-server
image:
repository: timescaledev/timescaledb-ha
tag: pg12.3-ts1.7.2-latest
pullPolicy: IfNotPresent
secretNames:
credentials: database-credentials
certificate: database-certificate
pgbackrest: pgbackrest-credentials
backup:
enabled: true
pgBackRest:
# https://pgbackrest.org/configuration.html
repo1-s3-bucket: timescaledb-backups
repo1-s3-endpoint: minio.wpb-system
repo1-s3-host: minio.wpb-system
repo1-s3-port: 9000
repo1-s3-verify-tls: 'n'
# For High Throughput clusters, the archiver will likely not be able to keep up if running synchronously.
# Therefore, we'll need to enable asynchronous archiving.
#
# Not using asynchronous archiving may cause failures far in the future, for example, on a
# cluster with a WAL Volume of 500GB, and an archiver lagging 200MB/minute, the primary pod will only
# fail after 2500 minutes, or almost 2 days.
pgBackRest:archive-push:
process-max: 4
archive-async: "y"
pgBackRest:archive-get:
process-max: 4
archive-async: "y"
archive-get-queue-max: 2GB
patroni:
bootstrap:
dcs:
# For High Throughput clusters it is likely that the bottleneck will be the recovery
# of the WAL on the replica's. If the high throughput is continuous, this means
# the replica's will never catch up.
# Enabling synchronous_mode will ensure that there is at least 1 synchronous replica,
# which will ensure that the throughput will be limited to whatever the replica process
# of the synchronous replica can process.
#
# https://patroni.readthedocs.io/en/latest/replication_modes.html#synchronous-mode
synchronous_mode: true
master_start_timeout: 0
postgresql:
# Using replication slots in high throughput situations has a significant danger:
# The master may need to retain more WAL than the size of the WAL Volume allows,
# if a replica is catching up.
#
# With backup enabled: while the replica is catching up it will be fetching archives
# from s3, until s3 is exhausted, after which the replica will switch to streaming
# from the master. While the replica is not yet streaming (this may take hours),
# the replication slot reserved for it is stale and will cause WAL to be retained by
# the master.
# If we have a backup, we don't need replication slots, as s3 will have the WAL, therefore,
# if the backup is enabled, we can safely disable the use of replication slots.
use_slots: false
parameters:
# For good performance we want to have a long interval between checkpoints,
# however for continuous high throughput, the replica's will have very limited
# capacity to catch up after a crash/restart.
#
# For example, if the recovery process of the replica requires on average
# 90% of the CPU, it could recover 1.11 seconds of WAL per second.
# If after a crash it is delayed by 5 minutes, it will only catch up in:
#
# 300/(1.11 - 1) = 2727 seconds (roughly 45 minutes)
#
# We therefore want to pick a middle ground between recoverability of instances
# and performance, and for now come up with 5 minutes.
checkpoint_timeout: 310s
temp_file_limit: 2GB
# remote_apply will throttle the replica's, see also the comment at patroni.bootstrap.synchronous_mode
synchronous_commit: remote_apply
# timescaledb-tune recommendations
shared_buffers: 1125MB
effective_cache_size: 3375MB
maintenance_work_mem: 576000kB
work_mem: 7680kB
wal_buffers: 16MB
min_wal_size: 3GB
max_wal_size: 6GB
default_statistics_target: 500
random_page_cost: 1.1
checkpoint_completion_target: 0.9
max_connections: 75
max_locks_per_transaction: 64
autovacuum_max_workers: 10
autovacuum_naptime: 10
effective_io_concurrency: 200
loadBalancer:
enabled: false
persistentVolumes:
data:
size: 20Gi
storageClass: "-"
wal:
size: 10Gi
storageClass: "-"
timescaledbTune:
args:
wal-disk-size: 10GB
resources:
requests:
cpu: 1000m
memory: 4500Mi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- timescaledb
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: "kubernetes.io/hostname"
labelSelector:
matchLabels:
app: default-server
release: default-server
cluster-name: default-server
- weight: 50
podAffinityTerm:
topologyKey: failure-domain.beta.kubernetes.io/zone
labelSelector:
matchLabels:
app: default-server
release: default-server
cluster-name: default-server
but shouldn't that be done by the patroni-job which is running after the helm-upgrade?
It depends. In this Helm Chart we choose no as an answer. The patroni-job that is running does the following:
work_mem
)And that's it.
As soon as we would restart pods during the upgrade process, we would be enforcing a downtime on you during the helm upgrade
command, which would defeat a lot of the purpose of these Helm Charts.
So, if you're going to change an option that requires a instance restart, you will have to do a rolling restart of all the pods, see the Kubernetes Reference, or use another mechanism to trigger Patroni to restart members.
Happy for any thoughts though on how this should or could work with these Charts!
Thanks for the explanation, now this makes sense to me.
But I still wonder why the values from tstune never take effect, even the ones that don't require a restart and are not overwritten by the chart defaults, like work_mem
or shared_buffers
. I suppose it has something to do with the fact that I set them explicitly before and they are not reset properly, so the values from timescaledb.config are always overwritten with the old explicitly specified values. Does that make sense?
In your example you do set work_mem
:
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
spec:
values:
patroni:
bootstrap:
dcs:
postgresql:
parameters:
work_mem: 7680kB
I can't find shared_buffers
though.
To troubleshoot, you could do:
$ kubectl exec -ti pod/<your pod> -c timescaledb -- bash
$ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf
For example, this is the output that I get, (my shared_buffers
are actually set to 7904MB
, so I cannot reproduce your issue yet):
/var/lib/postgresql/data/postgresql.base.conf:shared_buffers = 128MB # min 128kB
/var/lib/postgresql/data/postgresql.base.conf:#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
/var/run/postgresql/timescaledb.conf:shared_buffers = 7904MB
$ tr ' ' '\n' < $PGDATA/postmaster.opts
"-D"
"/var/lib/postgresql/data"
"--config-file=/var/lib/postgresql/data/postgresql.conf"
"--listen_addresses=0.0.0.0"
"--port=5432"
"--cluster_name=wal02"
"--wal_level=replica"
"--hot_standby=on"
"--max_connections=100"
"--max_wal_senders=10"
"--max_prepared_transactions=150"
"--max_locks_per_transaction=64"
"--track_commit_timestamp=off"
"--max_replication_slots=10"
"--max_worker_processes=8"
"--wal_log_hints=on"
Sorry for the confusion, when I said:
But I still wonder why the values from tstune never take effect, even the ones that don't require a restart and are not overwritten by the chart defaults, like work_mem or shared_buffers. I suppose it has something to do with the fact that I set them explicitly before and they are not reset properly, so the values from timescaledb.config are always overwritten with the old explicitly specified values. Does that make sense?
I meant I removed all the values, so my patroni config looked like this:
patroni:
bootstrap:
dcs:
synchronous_mode: true
master_start_timeout: 0
postgresql:
use_slots: false
parameters:
checkpoint_timeout: 310s
temp_file_limit: 2GB
synchronous_commit: remote_apply
Imo all the tuned values should then come from timescaledb.conf
.
But as you can see the values of timescaledb.conf
are getting overwritten by the ones from postgresql.conf
even though I did not set them in the values anymore.
postgres@default-server-0:~$ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf
postgres@default-server-0:~$ grep shared_buffers $PGDATA/*.conf /var/run/postgresql/timescaledb.conf
/var/lib/postgresql/data/postgresql.base.conf:shared_buffers = 128MB # min 128kB
/var/lib/postgresql/data/postgresql.base.conf:#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
/var/lib/postgresql/data/postgresql.conf:shared_buffers = '1125MB'
/var/run/postgresql/timescaledb.conf:shared_buffers = 1GB
postgres@default-server-0:~$ tr ' ' '\n' < $PGDATA/postmaster.opts
/usr/lib/postgresql/12/bin/postgres
"-D"
"/var/lib/postgresql/data"
"--config-file=/var/lib/postgresql/data/postgresql.conf"
"--listen_addresses=0.0.0.0"
"--port=5432"
"--cluster_name=default-server"
"--wal_level=replica"
"--hot_standby=on"
"--max_connections=75"
"--max_wal_senders=10"
"--max_prepared_transactions=150"
"--max_locks_per_transaction=64"
"--track_commit_timestamp=off"
"--max_replication_slots=10"
"--max_worker_processes=8"
"--wal_log_hints=on"
This causes timescaledb-tune to have no effect if the values have been set before, because these values are not reset somewhere and will always overwrite those of timesaledb.conf.
Hmm, that seems wrong,could you provide:
kubectl get ep/$RELEASENAME-config -o jsonpath="{.metadata.annotations.config}"
My guess is that you have to explicitly clear shared_buffers
in the patroni configuration:
If you want to remove (reset) some setting just patch it with null:
https://patroni.readthedocs.io/en/latest/rest_api.html
So the way you got into this situation is:
shared_buffers
configured in the dcs.postgresql.parameters
config
annotation of the config
endpointshared_buffers
is still set to that original valueI think that is not nice, or even worse a bug. I think we can also solve that, by not having the Patroni Config Job that is triggered by Helm to patch a Patroni endpoint (curl -X POST http://pod:8008/config
), but by actually changing the annotation on that endpoint directly, followed by a reload request to Patroni.
But, before we do, I'd really like to see the endpoint config
annotation.
Workaround for you for now:
kubectl edit ep/$RELEASENAME-config
and remove the shared_buffers
section
Thanks @feikesteenbergen !
We already worked around this issue and just set all parameters from timescaledb-tune explicitly in dcs.postgresql.parameters
followed by a patronictl restart.
But I set the cluster back in the state I described in the previous post to retrieve the desired endpoint config
annotation.
Output of kubectl get ep/$RELEASENAME-config -o jsonpath="{.metadata.annotations.config}"
{
"loop_wait": 10,
"master_start_timeout": 0,
"maximum_lag_on_failover": 33554432,
"postgresql": {
"parameters": {
"archive_command": "/etc/timescaledb/scripts/pgbackrest_archive.sh %p",
"archive_mode": "on",
"archive_timeout": "1800s",
"autovacuum_analyze_scale_factor": 0.02,
"autovacuum_max_workers": 10,
"autovacuum_naptime": "5s",
"autovacuum_vacuum_cost_limit": 500,
"autovacuum_vacuum_scale_factor": 0.05,
"checkpoint_timeout": "310s",
"hot_standby": "on",
"log_autovacuum_min_duration": "1min",
"log_checkpoints": "on",
"log_connections": "on",
"log_disconnections": "on",
"log_line_prefix": "%t [%p]: [%c-%l] %u@%d,app=%a [%e] ",
"log_lock_waits": "on",
"log_min_duration_statement": "1s",
"log_statement": "ddl",
"max_connections": 100,
"max_prepared_transactions": 150,
"shared_preload_libraries": "timescaledb,pg_stat_statements",
"ssl": "on",
"ssl_cert_file": "/etc/certificate/tls.crt",
"ssl_key_file": "/etc/certificate/tls.key",
"synchronous_commit": "remote_apply",
"tcp_keepalives_idle": 900,
"tcp_keepalives_interval": 100,
"temp_file_limit": "2GB",
"timescaledb.passfile": "../.pgpass",
"unix_socket_directories": "/var/run/postgresql",
"unix_socket_permissions": "0750",
"wal_level": "hot_standby",
"wal_log_hints": "on",
"checkpoint_completion_target": 0.9,
"default_statistics_target": 500,
"effective_cache_size": "3375MB",
"effective_io_concurrency": 200,
"maintenance_work_mem": "576000kB",
"max_locks_per_transaction": 64,
"max_wal_size": "6GB",
"min_wal_size": "3GB",
"random_page_cost": 1.1,
"shared_buffers": "1125MB",
"wal_buffers": "16MB",
"work_mem": "7680kB"
},
"use_pg_rewind": true,
"use_slots": false
},
"retry_timeout": 10,
"synchronous_mode": true,
"ttl": 30
}%
As you can see your guess was right, there are still the old values I removed from the dcs.postgresql.parameters
.
In the configuration file, e.g., postgresql.conf, if you have:
wal_level = 'replica' track_commit_timestamp = 'off'
The default value of “wal_level” is “replica”. Then, having setting “ track_commit_timestamp = 'off' “ will not influence any system behavior. To enable configuration “track_commit_timestamp” the “wal_level” must be changed other than the default “'replica'”.
Moreover, if the “wal_level” parameter and the “track_commit_timestamp” parameter are seperated in a large postgresql.conf (e.g., a lot of other parameters are in between), this will bring confusion to users.
Looking at the code snippet in PostgreSQL (see below), you can find that usage of “track_commit_timestamp” is "disabled".
if (wal_level != ControlFile-> wal_level...) Xlrec.track_commit_timestamp = track_commit_timestamp;
Hi, we are running the timescaledb-single chart (v0.6.2) with pg12.3-ts1.7.2-latest image on Kubernetes. We realized that all tuned values from
/var/run/postgresql/timescaledb.conf
are included in postgresql.conf. As I read in the patroni docs, patroni renames the postgresql.conf to postgresql.base.conf.When we look at the current postgresql.conf of a running pod we see that all tuned values are overwritten as the postgresql.base.conf file is included at the top of the file.
We have no clue where the values that override the tuned values come from. We did not specify them in the
patroni.bootstrap.dcs.postgresql.parameters
.timescaledb.conf
postgres.conf
patroni.yaml
Even when we specify values, for example,
max_connections
explicitly inpatroni.bootstrap.dcs.postgresql.parameters
thepatroni.yaml
reflect these values but they still get overwritten by some value set in postgresql.conf.We restarted the pod and the whole helm release but this had no effect on the postgresql config.
Any Idea what we are missing here?