Closed mickael-choisnet closed 1 year ago
Hi @mickael-choisnet,
Thank you for reporting the issue. I tried emulating a similar scenario, where I upgraded when a compression job was running but never was never able to reproduce the job crash. Were you able to reproduce it again?
Can you also provide the exact query that is causing the crash and the related table definitions if possible?
Hi @lkshminarayanan,
I did not successfully reproduce my initial bug but I managed to crash the job when updating the timescaledb extension.
2023-03-15 08:38:43 UTC [7882-1] [unknown]@[unknown],client=[local],app=[unknown] LOG: connection received: host=[local]
2023-03-15 08:38:43 UTC [7882-2] postgres@test-crash-db,client=[local],app=[unknown] LOG: connection authenticated: identity="postgres" method=peer (/etc/postgresql/14/main/pg_hba.conf:90)
2023-03-15 08:38:43 UTC [7882-3] postgres@test-crash-db,client=[local],app=[unknown] LOG: connection authorized: user=postgres database=test-crash-db application_name=psql
2023-03-15 08:38:45 UTC [7890-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG: connection received: host=127.0.0.1 port=56835
2023-03-15 08:38:46 UTC [7892-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG: connection received: host=127.0.0.1 port=62508
2023-03-15 08:38:51 UTC [7895-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG: connection received: host=127.0.0.1 port=57019
2023-03-15 08:38:52 UTC [7897-1] [unknown]@[unknown],client=127.0.0.1,app=[unknown] LOG: connection received: host=127.0.0.1 port=62623
2023-03-15 08:38:53 UTC [7219-1] FATAL: terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-03-15 08:38:53 UTC [7208-6] LOG: background worker "TimescaleDB Background Worker Scheduler" (PID 7219) exited with exit code 1
2023-03-15 08:38:53 UTC [7866-1] FATAL: terminating background worker "Compression Policy [1000]" due to administrator command
2023-03-15 08:38:53 UTC [7866-2] CONTEXT: SQL statement "SELECT public.compress_chunk(chunk_rec.oid)"
PL/pgSQL function _timescaledb_internal.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean) line 60 at PERFORM
SQL statement "CALL _timescaledb_internal.policy_compression_execute(
job_id, htid, lag_value::INTERVAL,
maxchunks, verbose_log, recompress_enabled
)"
PL/pgSQL function _timescaledb_internal.policy_compression(integer,jsonb) line 51 at CALL
2023-03-15 08:38:53 UTC [7208-7] LOG: background worker "Compression Policy [1000]" (PID 7866) exited with exit code 1
2023-03-15 08:38:53 UTC [7898-1] FATAL: terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-03-15 08:38:53 UTC [7208-8] LOG: background worker "TimescaleDB Background Worker Scheduler" (PID 7898) exited with exit code 1
2023-03-15 08:38:53 UTC [7882-4] postgres@test-crash-db,client=[local],app=psql LOG: duration: 936.297 ms statement: ALTER EXTENSION timescaledb UPDATE;
2023-03-15 08:38:54 UTC [7232-5] postgres@test-crash-db,client=172.16.160.135,app=pgAdmin 4 - DB:test-crash-db FATAL: "timescaledb" already loaded with a different version
2023-03-15 08:38:54 UTC [7232-6] postgres@test-crash-db,client=172.16.160.135,app=pgAdmin 4 - DB:test-crash-db DETAIL: The new version is "2.10.0", this session is using version "2.9.3". The session will be restarted.
But afterwards It has not resumed on its own and it appears to be broken. When I had done the update for my database, it had restarted on its own. I counldn't get it to restart and eventually dropped the compression policy and recreate it... The new job ran and did not replicate the state of this problem, but maybe if you try it like I did, you'll end up with my initial issue!
The queries causing the crash are the same as the ones I give at the end of the steps below and the tables structure of my database are almost the same as the ones I give in my example (I just lightened them up and used shorter names).
Here is the way to reproduce this behavior:
create a new database with timescaledb 2.9.3
CREATE DATABASE test-crash-db;
CREATE EXTENSION timescaledb VERSION '2.9.3';
create tables
CREATE TABLE public.measures (
measure_date TIMESTAMP NOT NULL,
received_date TIMESTAMP NOT NULL,
value DOUBLE PRECISION,
meter_id TEXT NOT NULL,
source SMALLINT NOT NULL,
device_type SMALLINT NULL,
device_id TEXT NULL
);
SELECT create_hypertable('measures', 'measure_date', chunk_time_interval => INTERVAL '5 DAYS', create_default_indexes => false);
ALTER TABLE measures SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'meter_id'
);
SELECT add_compression_policy(hypertable => 'measures', compress_after => INTERVAL '14 DAYS', if_not_exists => true);
CREATE INDEX ix_measures_meter_id_measure_date_received_date ON public.measures (meter_id, measure_date DESC, received_date DESC);
CREATE TABLE IF NOT EXISTS public.meter(
id TEXT,
date_install TIMESTAMP,
serial_number TEXT,
PRIMARY KEY(id)
);
create proc to generate measures
CREATE OR REPLACE PROCEDURE public.load_measures (IN gs_start INTEGER, IN gs_end INTEGER) AS
$BODY$
BEGIN
WITH RECURSIVE some_days AS (
SELECT '2023-01-01'::TIMESTAMP AS rdate
UNION ALL
SELECT rdate + INTERVAL '1 DAY'
FROM some_days
WHERE extract(YEAR FROM rdate + INTERVAL '1 DAY') = 2023 AND extract(MONTH FROM rdate + INTERVAL '1 DAY') <= 2
)
INSERT INTO measures(measure_date, received_date, value, meter_id, source, device_type, device_id)
SELECT
d.rdate + (floor(random() * 23) || ' HOURS')::INTERVAL AS measure_date,
current_timestamp::TIMESTAMP AS received_date,
floor(random() * 10000 + 1000) AS value,
CAST(gs.id as TEXT) AS meter_id,
5 AS source, 42 AS device_type, 'ABCDEF' AS device_id
FROM (SELECT generate_series(gs_start, gs_end) AS id) gs
INNER JOIN lateral (SELECT * FROM some_days) d ON true;
END;
$BODY$
LANGUAGE PLPGSQL;
load some data (82000000 measures / 2000000 meters ~13min)
INSERT INTO meter(id, date_install, serial_number)
SELECT
CAST(gs.id AS TEXT) AS id,
'2022-01-01'::TIMESTAMP + (floor(random() * 364) || ' DAYS')::INTERVAL AS date_install,
floor(random() * 10000000 + 1000000) AS serial_number
FROM (SELECT generate_series(1000000,2999999) AS id) gs;
CALL load_measures(gs_start => 1000000, gs_end => 2999999);
wait for the AUTO VACUUM to end
trigger the compression job (and wait until it ends)
DO
$BODY$
DECLARE
job_id INT;
BEGIN
SELECT job.id INTO job_id
FROM _timescaledb_config.bgw_job job
INNER JOIN _timescaledb_catalog.hypertable hypertable ON hypertable.id = job.hypertable_id
WHERE hypertable.table_name = 'measures';
PERFORM alter_job(job_id, next_start=> current_timestamp);
END;
$BODY$;
wait for the AUTO VACUUM to end
Load some more data into the compressed chunks (wait for the AUTO VACUUM triggered at the end to finish.)
CALL load_measures(gs_start => 1000000, gs_end => 1199999);
trigger the compression job again to compress the new data and while it runs (wait a few seconds to be sure it is in the middle of processing data) open a new connection to update timescaledb to 2.10.0
DO
$BODY$
DECLARE
job_id INT;
BEGIN
SELECT job.id INTO job_id
FROM _timescaledb_config.bgw_job job
INNER JOIN _timescaledb_catalog.hypertable hypertable ON hypertable.id = job.hypertable_id
WHERE hypertable.table_name = 'measures';
PERFORM alter_job(job_id, next_start=> current_timestamp);
END;
$BODY$;
-- from another connection, for example: psql -X -d test-crash-db
ALTER EXTENSION timescaledb UPDATE;
from there it depends if the job crashes and if it restarts on its own...
sample queries which makes the database to crash (to check the issue)
SELECT
time_bucket('1 MONTH', measures.measure_date) AS measure_day,
AVG(measures."value") AS average,
count(*) AS nb
FROM measures
INNER JOIN meter ON meter.id = measures.meter_id
WHERE meter.id = '1134567' AND measures.measure_date > '2023-01-05' AND measures.measure_date < '2023-02-25'
GROUP BY measure_day;
SELECT measures.*
FROM measures
INNER JOIN meter ON meter.id = measures.meter_id
WHERE meter.id = '1134567' AND (measures.measure_date <= '2023-01-05' OR measures.measure_date >= '2023-02-25')
ORDER BY measure_date DESC;
Hi @mickael-choisnet,
It is possible that the indexes of the measures table got corrupted during the compression job crash. Can you try rebuilding the indexes and see if that fixes your issue?
You can use the following command to rebuild the indexes :
REINDEX TABLE measures;
Hi @lkshminarayanan,
Unfortunately, the original database where the issue appeared has been rollback to timescaledb 2.9.3, and it solved the issue. I think it is unlikely that the index was involved because I tried to drop my hypertables and create new ones at the time and the issue was still here with these new hypertables.
@mickael-choisnet I've tried to reproduce the issue based on your instructions - no luck so far; I've tried pg14 and pg15
the only thing I didnt explicity have done is waiting for vacuum to finish - do you already know that in case vaccum is still running this issue doesn't happen?
I was re-reading your instructions; and something caught my eye:
from there it depends if the job crashes and if it restarts on its own...
could you tell me a bit more about this? I haven't seen it restarting after its being stopped during the upgrade
Hello @mickael-choisnet,
Thank you for all your help in debugging this issue. Unfortunately we haven't been able to reproduce the segfault even with the provided steps. If you are able to consistently reproduce the issue, we have a couple of suggestions that we would like to ask you to try if possible to help us better understand the situation:
Could you please share the plan output for the failing queries (EXPLAIN query) in version 2.10 and in version 2.9.3 where you do not see the issue?
Also as mentioned before, we think the state of the index might be the reason you are experiencing the segfault, so we’d like to rule out that possibility if we can. One idea we'd like to test would be to completely disable index scans, forcing sequential scans instead, to see if that would make the issue go away in 2.10. (setting enable_indexscan = false)
And lastly, to rule out that possibility, trying the failing queries again after rebuilding the index.
We will continue to investigate, but as we haven’t managed to reproduce the issue yet, if it's possible for you to provide the above that would be a great help. Thank you!
Hi guys,
Thank you for trying so hard to find and solve this issue!
@kgyrtkirk
do you already know that in case vaccum is still running this issue doesn't happen?
No, this was just to be in the same conditions as my database was when the issue happened.
I cannot tell if it is relevant to the use case or not... Just that there was no VACUUM
ongoing according to the logs and to the fact that the VACUUM
commands happen after the job complete.
could you tell me a bit more about this? I haven't seen it restarting after its being stopped during the upgrade
It's just that as long as I have been unable to reproduce the bug, I suppose this is somehow linked to the fact that in the original scenario the job restarted on its own. But it is a wild guess and can be a wrong one.
@konskov
I found out that I still had a backup of the database, light enough to be restored on my testing environment. After restoring this backup (it is a backup from Monday 27th, so the database was already downgrade to timescaledb 2.9.3) I upgraded timescaledb back to 2.10.0, and I tried what you suggested and here are the results.
SELECT ix.*
FROM indexes ix
INNER JOIN patrimoine_appareil p ON p.id = ix.appareil_id
WHERE ix.date_mesure > '2023-01-01'
LIMIT 100;
The query still makes the database crash with a segmentation fault error (only with 2.10.0. With timescale 2.9.3 everrything works fine).
The query does not make the database to crash if I set enable_indexscan
to false
for my SQL session!
So it seems to be the index as you suggested.
But I tried to REINDEX
both tables (indexes and patrimoine_apapreil) and the query still crashes the database when enable_indexscan is on...
@lkshminarayanan seems like it was a good hint to suspect indexes but rebuilding them does not fix the issue...
'Limit (cost=0.87..16.02 rows=100 width=53)'
' -> Nested Loop (cost=0.87..14294117.48 rows=94392067 width=53)'
' -> Append (cost=0.44..792549.64 rows=29804155 width=53)'
' -> Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1 (cost=0.44..0.44 rows=1000 width=54)'
' Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
' -> Index Scan using compress_hyper_29_62_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_62_chunk (cost=0.42..0.44 rows=1 width=287)'
' Index Cond: (appareil_id = id)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_28_chunk ix_1 (cost=0.00..1449.60 rows=530 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2 (cost=0.03..0.70 rows=21000 width=70)'
' Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
' -> Index Scan using compress_hyper_29_64_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_64_chunk (cost=0.28..0.70 rows=21 width=293)'
' Index Cond: (appareil_id = id)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_30_chunk ix_2 (cost=0.00..7.40 rows=272 width=70)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3 (cost=0.44..0.44 rows=1000 width=55)'
' Filter: ((date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone) AND (id = appareil_id))'
' -> Index Scan using compress_hyper_29_65_chunk__compressed_hypertable_29_appareil_i on compress_hyper_29_65_chunk (cost=0.41..0.44 rows=1 width=288)'
' Index Cond: (appareil_id = id)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_31_chunk ix_3 (cost=0.00..684.58 rows=28286 width=55)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_32_chunk ix_4 (cost=0.00..6163.70 rows=268536 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5 (cost=0.07..41.02 rows=562000 width=66)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_1086_chunk (cost=0.00..41.02 rows=562 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_34_chunk ix_6 (cost=0.00..342956.53 rows=15615748 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_945_chunk ix_7 (cost=0.00..292224.46 rows=13305783 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Index Only Scan using patrimoine_appareil_pkey on patrimoine_appareil p (cost=0.43..0.45 rows=1 width=10)'
' Index Cond: (id = ix_1.appareil_id)'
'Limit (cost=4.17..160.99 rows=100 width=53)'
' -> Nested Loop (cost=4.17..148023900.52 rows=94392067 width=53)'
' -> Append (cost=3.73..1127295.83 rows=94421155 width=53)'
' -> Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1 (cost=3.73..7317.27 rows=1961000 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_62_chunk (cost=0.00..7317.27 rows=1961 width=287)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_28_chunk ix_1 (cost=0.00..1449.60 rows=530 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2 (cost=0.06..445.93 rows=7434000 width=70)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_64_chunk (cost=0.00..445.93 rows=7434 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_30_chunk ix_2 (cost=0.00..7.40 rows=272 width=70)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3 (cost=0.07..3899.56 rows=55245000 width=55)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_65_chunk (cost=0.00..3899.56 rows=55245 width=288)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_31_chunk ix_3 (cost=0.00..684.58 rows=28286 width=55)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_32_chunk ix_4 (cost=0.00..6163.70 rows=268536 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5 (cost=0.07..41.02 rows=562000 width=66)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_1086_chunk (cost=0.00..41.02 rows=562 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_34_chunk ix_6 (cost=0.00..342956.53 rows=15615748 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_945_chunk ix_7 (cost=0.00..292224.46 rows=13305783 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Bitmap Heap Scan on patrimoine_appareil p (cost=0.44..1.56 rows=1 width=10)'
' Recheck Cond: (id = ix_1.appareil_id)'
' -> Bitmap Index Scan on patrimoine_appareil_pkey (cost=0.00..0.44 rows=1 width=0)'
' Index Cond: (id = ix_1.appareil_id)'
'Limit (cost=4.17..50.66 rows=100 width=53)'
' -> Nested Loop (cost=4.17..43885519.73 rows=94392067 width=53)'
' -> Append (cost=3.73..1125008.81 rows=94392067 width=53)'
' -> Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1 (cost=3.73..7317.27 rows=1961000 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_62_chunk (cost=0.00..7317.27 rows=1961 width=287)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2 (cost=0.06..445.93 rows=7434000 width=70)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_64_chunk (cost=0.00..445.93 rows=7434 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3 (cost=0.07..3899.56 rows=55245000 width=55)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_65_chunk (cost=0.00..3899.56 rows=55245 width=288)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_32_chunk ix_4 (cost=0.00..6163.70 rows=268536 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5 (cost=0.07..41.02 rows=562000 width=66)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_1086_chunk (cost=0.00..41.02 rows=562 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_34_chunk ix_6 (cost=0.00..342956.53 rows=15615748 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_945_chunk ix_7 (cost=0.00..292224.46 rows=13305783 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Index Only Scan using patrimoine_appareil_pkey on patrimoine_appareil p (cost=0.43..0.45 rows=1 width=10)'
' Index Cond: (id = ix_1.appareil_id)'
'Limit (cost=4.17..160.94 rows=100 width=53)'
' -> Nested Loop (cost=4.17..147976359.57 rows=94392067 width=53)'
' -> Append (cost=3.73..1125008.81 rows=94392067 width=53)'
' -> Custom Scan (DecompressChunk) on _hyper_28_28_chunk ix_1 (cost=3.73..7317.27 rows=1961000 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_62_chunk (cost=0.00..7317.27 rows=1961 width=287)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_30_chunk ix_2 (cost=0.06..445.93 rows=7434000 width=70)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_64_chunk (cost=0.00..445.93 rows=7434 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_31_chunk ix_3 (cost=0.07..3899.56 rows=55245000 width=55)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_65_chunk (cost=0.00..3899.56 rows=55245 width=288)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_32_chunk ix_4 (cost=0.00..6163.70 rows=268536 width=54)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Custom Scan (DecompressChunk) on _hyper_28_33_chunk ix_5 (cost=0.07..41.02 rows=562000 width=66)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on compress_hyper_29_1086_chunk (cost=0.00..41.02 rows=562 width=293)'
' Filter: (_ts_meta_max_1 > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_34_chunk ix_6 (cost=0.00..342956.53 rows=15615748 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Seq Scan on _hyper_28_945_chunk ix_7 (cost=0.00..292224.46 rows=13305783 width=53)'
' Filter: (date_mesure > ''2023-01-01 00:00:00''::timestamp without time zone)'
' -> Bitmap Heap Scan on patrimoine_appareil p (cost=0.44..1.56 rows=1 width=10)'
' Recheck Cond: (id = ix_1.appareil_id)'
' -> Bitmap Index Scan on patrimoine_appareil_pkey (cost=0.00..0.44 rows=1 width=0)'
' Index Cond: (id = ix_1.appareil_id)'
Mind you, I am not sure whether we can rely on these results. This morning, after starting my testing environment with the database I restored yesterday (producing the results I gave you), I tried my query again, and it works fine (with timescale 2.10.0). I don't know if there are some background tasks that fixed this or if the backup is too new considering the initial issue occurred 2 weeks ago. Or maybe downgrading to 2.9.3 and back to 2.10.0 does not put my database in the exact same state as it was when I first encounter this issue. But the first time my database stayed with this problem for several days, hence the downgrade to try to fix it (with success) to be able to use it again...
Thank you very much for your continued assistance in debugging this @mickael-choisnet.
I would say evidence points towards the state of the index being responsible for the segfault (as when disabling index scan on v2.10 the query returns the expected results, and also, the explain output for v2.9.3 shows that no indexScan is performed but instead seqScan is employed for the compressed chunks…)
I don’t know if you are still able to try out a few things for this issue... but if you are:
If the reindex command was applied on the hypertable (measures in the example you shared) it would not affect the compressed chunks and their indexes, because these are "children" of the compressed hypertable that is associated with the original hypertable, but is a different table itself. So indeed if reindex was performed on the original hypertable, it would not make a difference.
In this case the index we would like to rebuild is on a compressed chunk, so the way to rebuild that would be to either call REINDEX TABLE compressed_chunk
which would look something like REINDEX TABLE _timescaledb_internal.compress_hyper_29_1086_chunk
for a particular chunk, or to call REINDEX on the parent hypertable of the compressed chunk. That would reindex all compressed chunks. That command would look like REINDEX TABLE _timescaledb_internal._compressed_hypertable_29
.
recompress_chunk
on them, should rebuild their indexes as well, since it create a new compressed chunk. Would it be possible to try the above and see if the issue is fixed? (after enabling indexScan again) Thank you!
Hello @konskov,
Sorry for the late response. I didn't have time before today to check what you wanted.
I restored my backup to test what you suggested, but unfortunately it didn't work.
To be precise, I followed these steps:
Reindex all the involved tables
-- First time only these 3
REINDEX TABLE public.patrimoine_appareil;
REINDEX TABLE public."indexes";
REINDEX TABLE _timescaledb_internal._compressed_hypertable_29;
-- but to be sure I've done it again targetting every chunks (compressed or not) > 2023-01-01 (on top of the 3 first instructions)
REINDEX TABLE _timescaledb_internal._hyper_28_945_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_34_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_32_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_33_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_31_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_30_chunk;
REINDEX TABLE _timescaledb_internal._hyper_28_28_chunk;
REINDEX TABLE _timescaledb_internal.compress_hyper_29_1265_chunk;
REINDEX TABLE _timescaledb_internal.compress_hyper_29_1086_chunk;
REINDEX TABLE _timescaledb_internal.compress_hyper_29_65_chunk;
REINDEX TABLE _timescaledb_internal.compress_hyper_29_64_chunk;
REINDEX TABLE _timescaledb_internal.compress_hyper_29_62_chunk;
I even tried to VACUUM ANALYZE
all the tables/chunks involved after the REINDEX TABLE
to be sure.
Hello,
It appears that decompress/recompress all the chunks fix the issue for a while, but eventually it comes back.
It came to my attention last week that our development database had the same issue. We have fewer data into this database, so I tried the decompression/compression. The issue got back few days later (still it can be fixed with decompression/compression).
I tried to upgrade timescaledb to 2.10.1, but it does not fix it.
On Thursday, I had to push a migration that dropped all my hypertables and recreate them (same name but new tables). I loaded them and everything was fine, but on Friday the issue came back...
Is it possible I damaged my cluster somehow when upgrading timescaledb extention? I am pretty sure that at some point I tried to drop and recreate the database and got the issue afterward...
Hi there.
Great news! I was able to reproduce your issue and indeed, a fix for the issue has already landed on our main branch. We should be releasing a new version (2.11) soon that should include that fix.
Thank you for reporting this!
Thank you so much @antekresic
I look forward to testing that on my environment.
Just a heads up, to correct myself. Fix has landed in 2.10.3 which has already been released.
Happy testing and have a nice weekend! :)
Closing since fix is out.
What type of bug is this?
Crash
What subsystems and features are affected?
Compression
What happened?
I updated the timescaledb extention from 2.9.3 to 2.10.0 on my VM with
apt install
and then updated the extention on my database with theALTER EXTENSION timescaledb UPDATE
command while the compression jobs were in progress (didn't check before and didn't know they were running). The jobs crashed with a FATAL trace due to the upgrade. The jobs resumed their runs after creating a new session to use the new version of the timescaledb extension. I have traces of successful queries targeting hypertables with joins during the process.After the jobs finish, almost all queries targeting a hypertable with compressed chunk and with a join on another table crash the database with a 'segmentation fault' error and put it into recovery mode.
I couldn't reproduce this behavior with a new database and some fake data but it wasn't until this morning that I found out that the update was done while the jobs were in progress so I'll try again.
Here is a backtrace of the segfault error when I try a simple select with a join on a hypertable:
I'll leave the postgres logs showing the jobs crashing below
TimescaleDB version affected
2.10.0
PostgreSQL version used
PostgreSQL 14.5 (Debian 14.5-1.pgdg100+1)
What operating system did you use?
Debian GNU/Linux 10 (buster) (10.13)
What installation method did you use?
Deb/Apt
What platform did you run on?
Microsoft Azure Cloud
Relevant log output and stack trace
How can we reproduce the bug?