timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.9k stars 882 forks source link

[Bug]: Segmentation fault when querying compressed hypertable with join after updating timescaledb from 2.9.3 to 2.10.0 #5434

Closed mickael-choisnet closed 1 year ago

mickael-choisnet commented 1 year ago

What type of bug is this?

Crash

What subsystems and features are affected?

Compression

What happened?

I updated the timescaledb extention from 2.9.3 to 2.10.0 on my VM with apt install and then updated the extention on my database with the ALTER EXTENSION timescaledb UPDATE command while the compression jobs were in progress (didn't check before and didn't know they were running). The jobs crashed with a FATAL trace due to the upgrade. The jobs resumed their runs after creating a new session to use the new version of the timescaledb extension. I have traces of successful queries targeting hypertables with joins during the process.

After the jobs finish, almost all queries targeting a hypertable with compressed chunk and with a join on another table crash the database with a 'segmentation fault' error and put it into recovery mode.

I couldn't reproduce this behavior with a new database and some fake data but it wasn't until this morning that I found out that the update was done while the jobs were in progress so I'll try again.

Here is a backtrace of the segfault error when I try a simple select with a join on a hypertable:

Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x000055e76670b52e in CheckVarSlotCompatibility (attnum=1, vartype=25, slot=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1904
1904    ./build/../src/backend/executor/execExprInterp.c: No such file or directory.
#0  0x000055e76670b52e in CheckVarSlotCompatibility (attnum=1, vartype=25, slot=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1904
#1  0x000055e76670bbc4 in CheckVarSlotCompatibility (slot=0x0, vartype=<optimized out>, attnum=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:1902
#2  CheckExprStillValid (state=state@entry=0x55e76735d7d8, econtext=econtext@entry=0x55e76735e0a8) at ./build/../src/backend/executor/execExprInterp.c:1868
#3  0x000055e76670bbeb in ExecInterpExprStillValid (state=0x55e76735d7d8, econtext=0x55e76735e0a8, isNull=0x7ffd497afe57) at ./build/../src/backend/executor/execExprInterp.c:1818
#4  0x000055e76673686c in ExecEvalExpr (isNull=0x7ffd497afe57, econtext=0x55e76735e0a8, state=<optimized out>) at ./build/../src/include/executor/executor.h:320
#5  ExecIndexEvalRuntimeKeys (econtext=econtext@entry=0x55e76735e0a8, runtimeKeys=<optimized out>, numRuntimeKeys=<optimized out>) at ./build/../src/backend/executor/nodeIndexscan.c:634
#6  0x000055e766736986 in ExecReScanIndexScan (node=node@entry=0x55e767027388) at ./build/../src/backend/executor/nodeIndexscan.c:568
#7  0x000055e766705c38 in ExecReScan (node=0x55e767027388) at ./build/../src/backend/executor/execAmi.c:183
#8  0x000055e766736215 in ExecIndexScan (pstate=0x55e767027388) at ./build/../src/backend/executor/nodeIndexscan.c:530
#9  0x00007f36b77325e0 in ExecProcNode (node=0x55e767027388) at /usr/include/postgresql/14/server/executor/executor.h:257
#10 decompress_chunk_create_tuple (state=0x55e76701ac68) at ./tsl/src/nodes/decompress_chunk/exec.c:426
#11 decompress_chunk_exec (node=0x55e76701ac68) at ./tsl/src/nodes/decompress_chunk/exec.c:373
#12 0x000055e7667292fa in ExecProcNode (node=0x55e76701ac68) at ./build/../src/include/executor/executor.h:257
#13 ExecAppend (pstate=0x55e76701a038) at ./build/../src/backend/executor/nodeAppend.c:360
#14 0x000055e766740485 in ExecProcNode (node=0x55e76701a038) at ./build/../src/include/executor/executor.h:257
#15 ExecNestLoop (pstate=0x55e767019e98) at ./build/../src/backend/executor/nodeNestloop.c:109
#16 0x000055e766737fc9 in ExecProcNode (node=0x55e767019e98) at ./build/../src/include/executor/executor.h:257
#17 ExecLimit (pstate=0x55e767019ba8) at ./build/../src/backend/executor/nodeLimit.c:96
#18 0x000055e766713333 in ExecProcNode (node=0x55e767019ba8) at ./build/../src/include/executor/executor.h:257
#19 ExecutePlan (execute_once=<optimized out>, dest=0x55e767325628, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55e767019ba8, estate=0x55e7670198d8) at ./build/../src/backend/executor/execMain.c:1551
#20 standard_ExecutorRun (queryDesc=0x55e7670389e8, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:361
#21 0x000055e76688d38c in PortalRunSelect (portal=0x55e766f21458, forward=<optimized out>, count=0, dest=<optimized out>) at ./build/../src/backend/tcop/pquery.c:921
#22 0x000055e76688e71a in PortalRun (portal=0x55e766f21458, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x55e767325628, altdest=0x55e767325628, qc=0x7ffd497b0340) at ./build/../src/backend/tcop/pquery.c:765
#23 0x000055e76688a9a5 in exec_simple_query () at ./build/../src/backend/tcop/postgres.c:1213
#24 0x000055e76688c23b in PostgresMain () at ./build/../src/backend/tcop/postgres.c:4508
#25 0x000055e766809db6 in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4537
#26 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4259
#27 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1745
#28 0x000055e76680ac44 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1417
#29 0x000055e76657c762 in main (argc=5, argv=0x55e766e6ec00) at ./build/../src/backend/main/main.c:209
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

I'll leave the postgres logs showing the jobs crashing below

TimescaleDB version affected

2.10.0

PostgreSQL version used

PostgreSQL 14.5 (Debian 14.5-1.pgdg100+1)

What operating system did you use?

Debian GNU/Linux 10 (buster) (10.13)

What installation method did you use?

Deb/Apt

What platform did you run on?

Microsoft Azure Cloud

Relevant log output and stack trace

2023-02-27 09:19:41 UTC [11043-1] [unknown]@[unknown],client=[local],app=[unknown] LOG:  connection received: host=[local]
2023-02-27 09:19:41 UTC [11043-2] postgres@rec-db,client=[local],app=[unknown] LOG:  connection authenticated: identity="postgres" method=peer (/etc/postgresql/14/main/pg_hba.conf:93)
2023-02-27 09:19:41 UTC [11043-3] postgres@rec-db,client=[local],app=[unknown] LOG:  connection authorized: user=postgres database=rec-db application_name=psql
2023-02-27 09:19:41 UTC [11044-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=52736
2023-02-27 09:19:43 UTC [818-5288] LOG:  checkpoint complete: wrote 88 buffers (0.0%); 0 WAL file(s) added, 2 removed, 31 recycled; write=3.446 s, sync=0.974 s, total=4.703 s; sync files=16, longest=0.555 s, average=0.061 s; distance=547253 kB, estimate=563166 kB
2023-02-27 09:19:43 UTC [818-5289] LOG:  checkpoints are occurring too frequently (5 seconds apart)
2023-02-27 09:19:43 UTC [818-5290] HINT:  Consider increasing the configuration parameter "max_wal_size".
2023-02-27 09:19:43 UTC [818-5291] LOG:  checkpoint starting: wal
2023-02-27 09:19:46 UTC [11050-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=51145
2023-02-27 09:19:47 UTC [11051-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=52829
2023-02-27 09:19:48 UTC [826-4] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-02-27 09:19:48 UTC [830-263] FATAL:  terminating background worker "Compression Policy [1053]" due to administrator command
2023-02-27 09:19:48 UTC [830-264] CONTEXT:  SQL statement "SELECT public.compress_chunk(chunk_rec.oid)"
    PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 60 at PERFORM
    SQL statement "CALL _timescaledb_internal.policy_compression_execute(
            job_id, htid, lag_value::INTERVAL,
            maxchunks, verbose_log, recompress_enabled
          )"
    PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
2023-02-27 09:19:48 UTC [831-225] FATAL:  terminating background worker "Compression Policy [1055]" due to administrator command
2023-02-27 09:19:48 UTC [831-226] CONTEXT:  SQL statement "SELECT public.decompress_chunk(chunk_rec.oid, if_compressed => true)"
    PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 49 at PERFORM
    SQL statement "CALL _timescaledb_internal.policy_compression_execute(
            job_id, htid, lag_value::INTERVAL,
            maxchunks, verbose_log, recompress_enabled
          )"
    PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
2023-02-27 09:19:48 UTC [919-293] FATAL:  terminating background worker "Compression Policy [1052]" due to administrator command
2023-02-27 09:19:48 UTC [919-294] CONTEXT:  SQL statement "SELECT public.decompress_chunk(chunk_rec.oid, if_compressed => true)"
    PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 49 at PERFORM
    SQL statement "CALL _timescaledb_internal.policy_compression_execute(
            job_id, htid, lag_value::INTERVAL,
            maxchunks, verbose_log, recompress_enabled
          )"
    PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
2023-02-27 09:19:48 UTC [830-265] LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp830.131", size 0
2023-02-27 09:19:48 UTC [779-6] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 826) exited with exit code 1
2023-02-27 09:19:48 UTC [818-5292] LOG:  checkpoint complete: wrote 65 buffers (0.0%); 0 WAL file(s) added, 3 removed, 31 recycled; write=3.368 s, sync=0.952 s, total=4.672 s; sync files=5, longest=0.554 s, average=0.191 s; distance=556399 kB, estimate=562489 kB
2023-02-27 09:19:48 UTC [779-7] LOG:  background worker "Compression Policy [1055]" (PID 831) exited with exit code 1
2023-02-27 09:19:48 UTC [779-8] LOG:  background worker "Compression Policy [1052]" (PID 919) exited with exit code 1
2023-02-27 09:19:48 UTC [779-9] LOG:  background worker "Compression Policy [1053]" (PID 830) exited with exit code 1
2023-02-27 09:19:48 UTC [11052-1] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-02-27 09:19:48 UTC [779-10] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 11052) exited with exit code 1
2023-02-27 09:19:49 UTC [11043-4] postgres@rec-db,client=[local],app=psql LOG:  duration: 1474.348 ms  statement: ALTER EXTENSION timescaledb UPDATE;

How can we reproduce the bug?

Not determined yet...
lkshminarayanan commented 1 year ago

Hi @mickael-choisnet,

Thank you for reporting the issue. I tried emulating a similar scenario, where I upgraded when a compression job was running but never was never able to reproduce the job crash. Were you able to reproduce it again?

Can you also provide the exact query that is causing the crash and the related table definitions if possible?

mickael-choisnet commented 1 year ago

Hi @lkshminarayanan,

I did not successfully reproduce my initial bug but I managed to crash the job when updating the timescaledb extension.

2023-03-15 08:38:43 UTC [7882-1] [unknown]@[unknown],client=[local],app=[unknown] LOG:  connection received: host=[local]
2023-03-15 08:38:43 UTC [7882-2] postgres@test-crash-db,client=[local],app=[unknown] LOG:  connection authenticated: identity="postgres" method=peer (/etc/postgresql/14/main/pg_hba.conf:90)
2023-03-15 08:38:43 UTC [7882-3] postgres@test-crash-db,client=[local],app=[unknown] LOG:  connection authorized: user=postgres database=test-crash-db application_name=psql
2023-03-15 08:38:45 UTC [7890-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=56835
2023-03-15 08:38:46 UTC [7892-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=62508
2023-03-15 08:38:51 UTC [7895-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=57019
2023-03-15 08:38:52 UTC [7897-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG:  connection received: host=127.0.0.1 port=62623
2023-03-15 08:38:53 UTC [7219-1] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-03-15 08:38:53 UTC [7208-6] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 7219) exited with exit code 1
2023-03-15 08:38:53 UTC [7866-1] FATAL:  terminating background worker "Compression Policy [1000]" due to administrator command
2023-03-15 08:38:53 UTC [7866-2] CONTEXT:  SQL statement "SELECT public.compress_chunk(chunk_rec.oid)"
    PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 60 at PERFORM
    SQL statement "CALL _timescaledb_internal.policy_compression_execute(
            job_id, htid, lag_value::INTERVAL,
            maxchunks, verbose_log, recompress_enabled
          )"
    PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
2023-03-15 08:38:53 UTC [7208-7] LOG:  background worker "Compression Policy [1000]" (PID 7866) exited with exit code 1
2023-03-15 08:38:53 UTC [7898-1] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-03-15 08:38:53 UTC [7208-8] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 7898) exited with exit code 1
2023-03-15 08:38:53 UTC [7882-4] postgres@test-crash-db,client=[local],app=psql LOG:  duration: 936.297 ms  statement: ALTER EXTENSION timescaledb UPDATE;
2023-03-15 08:38:54 UTC [7232-5] postgres@test-crash-db,client=172.16.160.135,app=pgAdmin 4 - DB:test-crash-db FATAL:  "timescaledb" already loaded with a different version
2023-03-15 08:38:54 UTC [7232-6] postgres@test-crash-db,client=172.16.160.135,app=pgAdmin 4 - DB:test-crash-db DETAIL:  The new version is "2.10.0", this session is using version "2.9.3". The session will be restarted.

But afterwards It has not resumed on its own and it appears to be broken. When I had done the update for my database, it had restarted on its own. I counldn't get it to restart and eventually dropped the compression policy and recreate it... The new job ran and did not replicate the state of this problem, but maybe if you try it like I did, you'll end up with my initial issue!

The queries causing the crash are the same as the ones I give at the end of the steps below and the tables structure of my database are almost the same as the ones I give in my example (I just lightened them up and used shorter names).

Reproduce steps

Here is the way to reproduce this behavior:

  1. create a new database with timescaledb 2.9.3

    CREATE DATABASE test-crash-db;
    CREATE EXTENSION timescaledb VERSION '2.9.3';
  2. create tables

    CREATE TABLE public.measures (
       measure_date TIMESTAMP NOT NULL,
       received_date TIMESTAMP NOT NULL,
       value DOUBLE PRECISION,
       meter_id TEXT NOT NULL,
       source SMALLINT NOT NULL,
       device_type SMALLINT NULL,
       device_id TEXT NULL
    );
    SELECT create_hypertable('measures', 'measure_date', chunk_time_interval => INTERVAL '5 DAYS', create_default_indexes => false);
    ALTER TABLE measures SET (
       timescaledb.compress,
       timescaledb.compress_segmentby = 'meter_id'
    );
    SELECT add_compression_policy(hypertable => 'measures', compress_after => INTERVAL '14 DAYS', if_not_exists => true);
    CREATE INDEX ix_measures_meter_id_measure_date_received_date ON public.measures (meter_id, measure_date DESC, received_date DESC);
    CREATE TABLE IF NOT EXISTS public.meter(
       id TEXT,
       date_install TIMESTAMP,
       serial_number TEXT,
       PRIMARY KEY(id)
    );
  3. create proc to generate measures

    CREATE OR REPLACE PROCEDURE public.load_measures (IN gs_start INTEGER, IN gs_end INTEGER) AS
    $BODY$
    BEGIN
       WITH RECURSIVE some_days AS (
           SELECT '2023-01-01'::TIMESTAMP AS rdate 
           UNION ALL 
           SELECT rdate + INTERVAL '1 DAY'
           FROM some_days
           WHERE extract(YEAR FROM rdate + INTERVAL '1 DAY') = 2023 AND extract(MONTH FROM rdate + INTERVAL '1 DAY') <= 2
       )
       INSERT INTO measures(measure_date, received_date, value, meter_id, source, device_type, device_id)
       SELECT
           d.rdate + (floor(random() * 23) || ' HOURS')::INTERVAL AS measure_date, 
           current_timestamp::TIMESTAMP AS received_date, 
           floor(random() * 10000 + 1000) AS value, 
           CAST(gs.id as TEXT) AS meter_id, 
           5 AS source, 42 AS device_type, 'ABCDEF' AS device_id
       FROM (SELECT generate_series(gs_start, gs_end) AS id) gs
       INNER JOIN lateral (SELECT * FROM some_days) d ON true;
    END;
    $BODY$ 
    LANGUAGE PLPGSQL;
  4. load some data (82000000 measures / 2000000 meters ~13min)

    INSERT INTO meter(id, date_install, serial_number)
    SELECT 
       CAST(gs.id AS TEXT) AS id, 
       '2022-01-01'::TIMESTAMP + (floor(random() * 364) || ' DAYS')::INTERVAL AS date_install,
       floor(random() * 10000000 + 1000000) AS serial_number
    FROM (SELECT generate_series(1000000,2999999) AS id) gs;
    
    CALL load_measures(gs_start => 1000000, gs_end => 2999999);
  5. wait for the AUTO VACUUM to end

  6. trigger the compression job (and wait until it ends)

    DO
    $BODY$
       DECLARE
           job_id  INT;
       BEGIN
           SELECT job.id INTO job_id
           FROM _timescaledb_config.bgw_job job
           INNER JOIN _timescaledb_catalog.hypertable hypertable ON hypertable.id = job.hypertable_id
           WHERE hypertable.table_name = 'measures';
    
           PERFORM alter_job(job_id, next_start=> current_timestamp);
       END;
    $BODY$;
  7. wait for the AUTO VACUUM to end

  8. Load some more data into the compressed chunks (wait for the AUTO VACUUM triggered at the end to finish.)

    CALL load_measures(gs_start => 1000000, gs_end => 1199999);
  9. trigger the compression job again to compress the new data and while it runs (wait a few seconds to be sure it is in the middle of processing data) open a new connection to update timescaledb to 2.10.0

    DO
    $BODY$
       DECLARE
           job_id  INT;
       BEGIN
           SELECT job.id INTO job_id
           FROM _timescaledb_config.bgw_job job
           INNER JOIN _timescaledb_catalog.hypertable hypertable ON hypertable.id = job.hypertable_id
           WHERE hypertable.table_name = 'measures';
    
           PERFORM alter_job(job_id, next_start=> current_timestamp);
       END;
    $BODY$;
    -- from another connection, for example: psql -X -d test-crash-db
    ALTER EXTENSION timescaledb UPDATE;
  10. from there it depends if the job crashes and if it restarts on its own...

  11. sample queries which makes the database to crash (to check the issue)

    SELECT 
       time_bucket('1 MONTH', measures.measure_date) AS measure_day,
       AVG(measures."value") AS average,
       count(*) AS nb
    FROM measures
    INNER JOIN meter ON meter.id = measures.meter_id
    WHERE meter.id = '1134567' AND measures.measure_date > '2023-01-05' AND measures.measure_date < '2023-02-25' 
    GROUP BY measure_day;
    
    SELECT measures.*
    FROM measures
    INNER JOIN meter ON meter.id = measures.meter_id
    WHERE meter.id = '1134567' AND (measures.measure_date <= '2023-01-05' OR measures.measure_date >= '2023-02-25')
    ORDER BY measure_date DESC;
lkshminarayanan commented 1 year ago

Hi @mickael-choisnet,

It is possible that the indexes of the measures table got corrupted during the compression job crash. Can you try rebuilding the indexes and see if that fixes your issue?

You can use the following command to rebuild the indexes :

REINDEX TABLE measures;
mickael-choisnet commented 1 year ago

Hi @lkshminarayanan,

Unfortunately, the original database where the issue appeared has been rollback to timescaledb 2.9.3, and it solved the issue. I think it is unlikely that the index was involved because I tried to drop my hypertables and create new ones at the time and the issue was still here with these new hypertables.

kgyrtkirk commented 1 year ago

@mickael-choisnet I've tried to reproduce the issue based on your instructions - no luck so far; I've tried pg14 and pg15

the only thing I didnt explicity have done is waiting for vacuum to finish - do you already know that in case vaccum is still running this issue doesn't happen?

kgyrtkirk commented 1 year ago

I was re-reading your instructions; and something caught my eye:

from there it depends if the job crashes and if it restarts on its own...

could you tell me a bit more about this? I haven't seen it restarting after its being stopped during the upgrade

konskov commented 1 year ago

Hello @mickael-choisnet,

Thank you for all your help in debugging this issue. Unfortunately we haven't been able to reproduce the segfault even with the provided steps. If you are able to consistently reproduce the issue, we have a couple of suggestions that we would like to ask you to try if possible to help us better understand the situation:

Could you please share the plan output for the failing queries (EXPLAIN query) in version 2.10 and in version 2.9.3 where you do not see the issue?

Also as mentioned before, we think the state of the index might be the reason you are experiencing the segfault, so we’d like to rule out that possibility if we can. One idea we'd like to test would be to completely disable index scans, forcing sequential scans instead, to see if that would make the issue go away in 2.10. (setting enable_indexscan = false)

And lastly, to rule out that possibility, trying the failing queries again after rebuilding the index.

We will continue to investigate, but as we haven’t managed to reproduce the issue yet, if it's possible for you to provide the above that would be a great help. Thank you!

mickael-choisnet commented 1 year ago

Hi guys,

Thank you for trying so hard to find and solve this issue!

@kgyrtkirk

do you already know that in case vaccum is still running this issue doesn't happen?

No, this was just to be in the same conditions as my database was when the issue happened. I cannot tell if it is relevant to the use case or not... Just that there was no VACUUM ongoing according to the logs and to the fact that the VACUUM commands happen after the job complete.

could you tell me a bit more about this? I haven't seen it restarting after its being stopped during the upgrade

It's just that as long as I have been unable to reproduce the bug, I suppose this is somehow linked to the fact that in the original scenario the job restarted on its own. But it is a wild guess and can be a wrong one.

@konskov

I found out that I still had a backup of the database, light enough to be restored on my testing environment. After restoring this backup (it is a backup from Monday 27th, so the database was already downgrade to timescaledb 2.9.3) I upgraded timescaledb back to 2.10.0, and I tried what you suggested and here are the results.

Results

The query

SELECT ix.*
FROM indexes ix
INNER JOIN patrimoine_appareil p ON p.id = ix.appareil_id
WHERE ix.date_mesure > '2023-01-01' 
LIMIT 100;

The query still makes the database crash with a segmentation fault error (only with 2.10.0. With timescale 2.9.3 everrything works fine). The query does not make the database to crash if I set enable_indexscan to false for my SQL session! So it seems to be the index as you suggested. But I tried to REINDEX both tables (indexes and patrimoine_apapreil) and the query still crashes the database when enable_indexscan is on...

@lkshminarayanan seems like it was a good hint to suspect indexes but rebuilding them does not fix the issue...

Explain (timescaledb 2.10.0 / enable_indexscan on)

'Limit  (cost=0.87..16.02 rows=100 width=53)'
'  ->  Nested Loop  (cost=0.87..14294117.48 rows=94392067 width=53)'
'        ->  Append  (cost=0.44..792549.64 rows=29804155 width=53)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1  (cost=0.44..0.44 rows=1000 width=54)'
'                    Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
'                    ->  Index Scan using compress_hyper_29_62_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_62_chunk  (cost=0.42..0.44 rows=1 width=287)'
'                          Index Cond: (appareil_id = id)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_28_chunk ix_1  (cost=0.00..1449.60 rows=530 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2  (cost=0.03..0.70 rows=21000 width=70)'
'                    Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
'                    ->  Index Scan using compress_hyper_29_64_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_64_chunk  (cost=0.28..0.70 rows=21 width=293)'
'                          Index Cond: (appareil_id = id)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_30_chunk ix_2  (cost=0.00..7.40 rows=272 width=70)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3  (cost=0.44..0.44 rows=1000 width=55)'
'                    Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
'                    ->  Index Scan using compress_hyper_29_65_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_65_chunk  (cost=0.41..0.44 rows=1 width=288)'
'                          Index Cond: (appareil_id = id)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_31_chunk ix_3  (cost=0.00..684.58 rows=28286 width=55)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_32_chunk ix_4  (cost=0.00..6163.70 rows=268536 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5  (cost=0.07..41.02 rows=562000 width=66)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_1086_chunk  (cost=0.00..41.02 rows=562 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_34_chunk ix_6  (cost=0.00..342956.53 rows=15615748 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_945_chunk ix_7  (cost=0.00..292224.46 rows=13305783 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'        ->  Index Only Scan using patrimoine_appareil_pkey on patrimoine_appareil p  (cost=0.43..0.45 rows=1 width=10)'
'              Index Cond: (id = ix_1.appareil_id)'

Explain (timescaledb 2.10.0 / enable_indexscan off)

'Limit  (cost=4.17..160.99 rows=100 width=53)'
'  ->  Nested Loop  (cost=4.17..148023900.52 rows=94392067 width=53)'
'        ->  Append  (cost=3.73..1127295.83 rows=94421155 width=53)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1  (cost=3.73..7317.27 rows=1961000 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_62_chunk  (cost=0.00..7317.27 rows=1961 width=287)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_28_chunk ix_1  (cost=0.00..1449.60 rows=530 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2  (cost=0.06..445.93 rows=7434000 width=70)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_64_chunk  (cost=0.00..445.93 rows=7434 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_30_chunk ix_2  (cost=0.00..7.40 rows=272 width=70)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3  (cost=0.07..3899.56 rows=55245000 width=55)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_65_chunk  (cost=0.00..3899.56 rows=55245 width=288)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_31_chunk ix_3  (cost=0.00..684.58 rows=28286 width=55)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_32_chunk ix_4  (cost=0.00..6163.70 rows=268536 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5  (cost=0.07..41.02 rows=562000 width=66)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_1086_chunk  (cost=0.00..41.02 rows=562 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_34_chunk ix_6  (cost=0.00..342956.53 rows=15615748 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_945_chunk ix_7  (cost=0.00..292224.46 rows=13305783 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'        ->  Bitmap Heap Scan on patrimoine_appareil p  (cost=0.44..1.56 rows=1 width=10)'
'              Recheck Cond: (id = ix_1.appareil_id)'
'              ->  Bitmap Index Scan on patrimoine_appareil_pkey  (cost=0.00..0.44 rows=1 width=0)'
'                    Index Cond: (id = ix_1.appareil_id)'

Explain (timescaledb 2.9.3 / enable_indexscan on)

'Limit  (cost=4.17..50.66 rows=100 width=53)'
'  ->  Nested Loop  (cost=4.17..43885519.73 rows=94392067 width=53)'
'        ->  Append  (cost=3.73..1125008.81 rows=94392067 width=53)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1  (cost=3.73..7317.27 rows=1961000 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_62_chunk  (cost=0.00..7317.27 rows=1961 width=287)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2  (cost=0.06..445.93 rows=7434000 width=70)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_64_chunk  (cost=0.00..445.93 rows=7434 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3  (cost=0.07..3899.56 rows=55245000 width=55)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_65_chunk  (cost=0.00..3899.56 rows=55245 width=288)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_32_chunk ix_4  (cost=0.00..6163.70 rows=268536 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5  (cost=0.07..41.02 rows=562000 width=66)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_1086_chunk  (cost=0.00..41.02 rows=562 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_34_chunk ix_6  (cost=0.00..342956.53 rows=15615748 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_945_chunk ix_7  (cost=0.00..292224.46 rows=13305783 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'        ->  Index Only Scan using patrimoine_appareil_pkey on patrimoine_appareil p  (cost=0.43..0.45 rows=1 width=10)'
'              Index Cond: (id = ix_1.appareil_id)'

Explain (timescaledb 2.9.3 / enable_indexscan off)

'Limit  (cost=4.17..160.94 rows=100 width=53)'
'  ->  Nested Loop  (cost=4.17..147976359.57 rows=94392067 width=53)'
'        ->  Append  (cost=3.73..1125008.81 rows=94392067 width=53)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1  (cost=3.73..7317.27 rows=1961000 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_62_chunk  (cost=0.00..7317.27 rows=1961 width=287)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2  (cost=0.06..445.93 rows=7434000 width=70)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_64_chunk  (cost=0.00..445.93 rows=7434 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3  (cost=0.07..3899.56 rows=55245000 width=55)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_65_chunk  (cost=0.00..3899.56 rows=55245 width=288)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_32_chunk ix_4  (cost=0.00..6163.70 rows=268536 width=54)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5  (cost=0.07..41.02 rows=562000 width=66)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'                    ->  Seq Scan on compress_hyper_29_1086_chunk  (cost=0.00..41.02 rows=562 width=293)'
'                          Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_34_chunk ix_6  (cost=0.00..342956.53 rows=15615748 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'              ->  Seq Scan on _hyper_28_945_chunk ix_7  (cost=0.00..292224.46 rows=13305783 width=53)'
'                    Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
'        ->  Bitmap Heap Scan on patrimoine_appareil p  (cost=0.44..1.56 rows=1 width=10)'
'              Recheck Cond: (id = ix_1.appareil_id)'
'              ->  Bitmap Index Scan on patrimoine_appareil_pkey  (cost=0.00..0.44 rows=1 width=0)'
'                    Index Cond: (id = ix_1.appareil_id)'
mickael-choisnet commented 1 year ago

Mind you, I am not sure whether we can rely on these results. This morning, after starting my testing environment with the database I restored yesterday (producing the results I gave you), I tried my query again, and it works fine (with timescale 2.10.0). I don't know if there are some background tasks that fixed this or if the backup is too new considering the initial issue occurred 2 weeks ago. Or maybe downgrading to 2.9.3 and back to 2.10.0 does not put my database in the exact same state as it was when I first encounter this issue. But the first time my database stayed with this problem for several days, hence the downgrade to try to fix it (with success) to be able to use it again...

konskov commented 1 year ago

Thank you very much for your continued assistance in debugging this @mickael-choisnet.

I would say evidence points towards the state of the index being responsible for the segfault (as when disabling index scan on v2.10 the query returns the expected results, and also, the explain output for v2.9.3 shows that no indexScan is performed but instead seqScan is employed for the compressed chunks…)

I don’t know if you are still able to try out a few things for this issue... but if you are:

If the reindex command was applied on the hypertable (measures in the example you shared) it would not affect the compressed chunks and their indexes, because these are "children" of the compressed hypertable that is associated with the original hypertable, but is a different table itself. So indeed if reindex was performed on the original hypertable, it would not make a difference.

In this case the index we would like to rebuild is on a compressed chunk, so the way to rebuild that would be to either call REINDEX TABLE compressed_chunk which would look something like REINDEX TABLE _timescaledb_internal.compress_hyper_29_1086_chunk for a particular chunk, or to call REINDEX on the parent hypertable of the compressed chunk. That would reindex all compressed chunks. That command would look like REINDEX TABLE _timescaledb_internal._compressed_hypertable_29.

Would it be possible to try the above and see if the issue is fixed? (after enabling indexScan again) Thank you!

mickael-choisnet commented 1 year ago

Hello @konskov,

Sorry for the late response. I didn't have time before today to check what you wanted.

I restored my backup to test what you suggested, but unfortunately it didn't work.

To be precise, I followed these steps:

  1. Restore the backup
  2. Check that my query work (timescaledb 2.9.3)
  3. Update timescaledb to 2.10.0
  4. Check that my query still causes a seg fault crash
  5. Reindex all the involved tables

    -- First time only these 3
    REINDEX TABLE public.patrimoine_appareil;
    REINDEX TABLE public."indexes";
    REINDEX TABLE _timescaledb_internal._compressed_hypertable_29;
    
    -- but to be sure I've done it again targetting every chunks (compressed or not) > 2023-01-01 (on top of the 3 first instructions)
    REINDEX TABLE _timescaledb_internal._hyper_28_945_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_34_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_32_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_33_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_31_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_30_chunk;
    REINDEX TABLE _timescaledb_internal._hyper_28_28_chunk;
    
    REINDEX TABLE _timescaledb_internal.compress_hyper_29_1265_chunk;
    REINDEX TABLE _timescaledb_internal.compress_hyper_29_1086_chunk;
    REINDEX TABLE _timescaledb_internal.compress_hyper_29_65_chunk;
    REINDEX TABLE _timescaledb_internal.compress_hyper_29_64_chunk;
    REINDEX TABLE _timescaledb_internal.compress_hyper_29_62_chunk;
  6. Checked my query (still makes the database crash)

I even tried to VACUUM ANALYZE all the tables/chunks involved after the REINDEX TABLE to be sure.

mickael-choisnet commented 1 year ago

Hello,

It appears that decompress/recompress all the chunks fix the issue for a while, but eventually it comes back.

It came to my attention last week that our development database had the same issue. We have fewer data into this database, so I tried the decompression/compression. The issue got back few days later (still it can be fixed with decompression/compression).

I tried to upgrade timescaledb to 2.10.1, but it does not fix it.

On Thursday, I had to push a migration that dropped all my hypertables and recreate them (same name but new tables). I loaded them and everything was fine, but on Friday the issue came back...

Is it possible I damaged my cluster somehow when upgrading timescaledb extention? I am pretty sure that at some point I tried to drop and recreate the database and got the issue afterward...

antekresic commented 1 year ago

Hi there.

Great news! I was able to reproduce your issue and indeed, a fix for the issue has already landed on our main branch. We should be releasing a new version (2.11) soon that should include that fix.

Thank you for reporting this!

mickael-choisnet commented 1 year ago

Thank you so much @antekresic

I look forward to testing that on my environment.

antekresic commented 1 year ago

Just a heads up, to correct myself. Fix has landed in 2.10.3 which has already been released.

Happy testing and have a nice weekend! :)

erimatnor commented 1 year ago

Closing since fix is out.