timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.79k stars 884 forks source link

[Bug]: server crash: DETAIL: Failed process was running: autovacuum: ANALYZE _timescaledb_internal._hyper_1553_1989_chunk #6780

Closed fpattis closed 7 months ago

fpattis commented 7 months ago

What type of bug is this?

Crash

What subsystems and features are affected?

Other

What happened?

the timescale server crashes from ~1-2 times an hour. the writing application only gets a "connection closed" error. while the timescale server shows the attached logs.

TimescaleDB version affected

2.9.1

PostgreSQL version used

14.6

What operating system did you use?

helm chart with image: "timescale/timescaledb-ha:pg14.6-ts2.9.1-p1"

What installation method did you use?

Other

What platform did you run on?

Microsoft Azure Cloud

Relevant log output and stack trace

2024-03-15 07:46:35 UTC [30]: [65f2f0bb.1e-416] @,app= [00000] LOG:  server process (PID 47440) was terminated by signal 9: Killed
2024-03-15 07:46:35 UTC [30]: [65f2f0bb.1e-417] @,app= [00000] DETAIL:  Failed process was running: autovacuum: ANALYZE _timescaledb_internal._hyper_1553_1989_chunk
2024-03-15 07:46:35 UTC [30]: [65f2f0bb.1e-418] @,app= [00000] LOG:  terminating any other active server processes
2024-03-15 07:46:35 UTC [30]: [65f2f0bb.1e-419] @,app= [00000] LOG:  all server processes terminated; reinitializing
2024-03-15 07:46:36 UTC [47470]: [65f3fcdc.b96e-1] @,app= [00000] LOG:  database system was interrupted; last known up at 2024-03-15 07:38:46 UTC
2024-03-15 07:46:36 UTC [47470]: [65f3fcdc.b96e-2] @,app= [00000] LOG:  database system was not properly shut down; automatic recovery in progress
2024-03-15 07:46:36 UTC [47470]: [65f3fcdc.b96e-3] @,app= [00000] LOG:  redo starts at 79/AE322860
2024-03-15 07:46:37 UTC [47470]: [65f3fcdc.b96e-4] @,app= [00000] LOG:  invalid record length at 79/C08650F8: wanted 24, got 0
2024-03-15 07:46:37 UTC [47470]: [65f3fcdc.b96e-5] @,app= [00000] LOG:  redo done at 79/C0863620 system usage: CPU: user: 0.86 s, system: 0.23 s, elapsed: 1.10 s
2024-03-15 07:46:37 UTC [47470]: [65f3fcdc.b96e-6] @,app= [00000] LOG:  checkpoint starting: end-of-recovery immediate
2024-03-15 07:46:37,886 ERROR: Exception when called state_handler.last_operation()
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/ha.py", line 157, in update_lock
    last_lsn = self.state_handler.last_operation()
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 993, in last_operation
    return self._wal_position(self.is_leader(), self._cluster_info_state_get('wal_position'),
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 378, in is_leader
    return bool(self._cluster_info_state_get('timeline'))
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 340, in _cluster_info_state_get
    result = self._is_leader_retry(self._query, self.cluster_info_query).fetchone()
  File "/usr/lib/python3/dist-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 278, in _query
    raise e
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 267, in _query
    cursor.execute(sql, params or None)
psycopg2.DatabaseError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

How can we reproduce the bug?

no idea. we are running two timescale instances, this instance contains tables with lots of columns up to 5k.
mkindahl commented 7 months ago

Hello @fpattis and thank you for the bug report.

If you look in the log, the command was killed by a SIGKILL (signal 9) so it was either Linux that reaped the backend due to out of memory, or an explicit kill of the backend for some other reason.

fpattis commented 7 months ago

thank you very much for your help, i doubled the memory and so far the issue did not appear again 🤗