ebrodie commented 1 year ago

What did you do? No action taken, the db cluster was running as normal overnight with no unusual activity before the crash/restart.

Did you expect to see some different? Yes, the instance restarted due to a crash.

Environment Production.

Which helm chart and what version are you using?


apiVersion: v1
description: TimescaleDB HA Deployment.
home: https://github.com/timescale/timescaledb-kubernetes
maintainers:

email: support@timescale.com name: TimescaleDB name: timescaledb-single sources:
https://github.com/timescale/timescaledb-kubernetes
https://github.com/timescale/timescaledb-docker-ha
https://github.com/zalando/patroni version: 0.7.1

What is in your values.yaml ?


# Based on the timescale base values and examples
# https://bit.ly/38GTEP8
# https://bit.ly/2DvFPrd
# replicaCount: 3

replicaLoadBalancer: enabled: True annotations: service.beta.kubernetes.io/aws-load-balancer-internal: "true" service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000" loadBalancer: annotations: service.beta.kubernetes.io/aws-load-balancer-internal: "true" service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"

patroni: bootstrap: method: restore_or_initdb restore_or_initdb: command: > /etc/timescaledb/scripts/restore_or_initdb.sh --encoding=UTF8 --locale=C.UTF-8 --wal-segsize=256 dcs: synchronous_mode: true master_start_timeout: 0 postgresql: use_slots: false parameters: archive_timeout: 1h checkpoint_timeout: 600s temp_file_limit: '200GB' synchronous_commit: remote_apply

synchronous_commit: local

      #synchronous_commit: remote_write
      max_connections: '${max_connections}'
      wal_keep_segments: '${wal_keep_segments}'
      min_wal_size: '150GB'
      max_standby_archive_delay: 1200000
      max_standby_streaming_delay: 1200000
      # Add logging for testing issue with Lambdas
      log_statement: 'all'
      log_directory: 'pg_log'
      log_filename: 'postgresql.%H%M.log'
      logging_collector: on
      log_min_error_statement: error
      log_truncate_on_rotation: on
      log_rotation_age: 60
      log_rotation_size: 1000000
      #memory set by ts_tune
      #work_mem: 8192kB
      #shared_buffers: 6GB
      #maintenance_work_mem: 1GB
      #effective_cache_size: 16GB

persistentVolumes: data: size: '${ebs_vol_size}' wal: enabled: True size: '${wal_vol_size}'

timescaledbTune: enabled: true

sharedMemory: useMount: false

backup: enabled: true pgBackRest:archive-push: process-max: 4 archive-async: "y" pgBackRest:archive-get: process-max: 4 archive-async: "y" archive-get-queue-max: 2GB jobs:

name: fl-dly type: full schedule: "18 0 *"
name: inc-hly type: incr schedule: "18 1-23 *"


* Kubernetes version information:

  `kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-ffeb93d", GitCommit:"f76e2b475d1433cdb6bd546e9e8f129fde938fb7", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:00Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}`

    <!-- Replace the command with its output above -->

* Kubernetes cluster kind:

Kubernets cluster created in AWS EKS. It's been live for years.

**Anything else we need to know?**:
The full error we saw before that instance restart was:
`2023-01-26 02:25:43,972 ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))`

nhudson commented 1 year ago

Can you explain how this is a potential helm chart issue? That error came from which service? TimescaleDB or Patroni?

ebrodie commented 1 year ago

TimescaleDB logs.

I'm not fully sure it's a helm chart issue. I could not find any information on this error anywhere else, so I thought I'd give it a shot.

nhudson commented 1 year ago

It might be best to open an issue there. This seems like either a Patroni or TimescaleDB error and not an issue with the helm chart itself.

ebrodie commented 1 year ago

ok, will do, thanks!

timescale / helm-charts

ERROR: ObjectCache.run ProtocolError caused Postgres instance restart #561

synchronous_commit: local