Closed akamensky closed 4 years ago
Can you please add a stack trace to the issue and show the call to drop_chunks
that you use?
@mkindahl i would provide a stacktrace in top message if timescale rpms were coming with debug symbols in them (I.e. in “debug” package). However there are none. Top message already contains all of what you’d see without debug symbols.
FWIW we “fixed” the issue by dropping tables and re-creating them one by one (and checking if drop_chunks works after every table). Though if that starts happening in our production environment that would be 100% unacceptable.
It shouldn't be necessary to fix issues in this way, so we should fix it.
Looking into if we can add debug symbols to our packages, but do you have the call to drop_chunks
that you used and the definition of the hypertable/continuous aggregate that you applied the drop_chunks
to? That would at least reduce the amount of code I have to go through.
Drop chunks is done via:
00 */1 * * 1-5 postgres out=$(psql -U postgres -d db_name -c "SELECT drop_chunks(interval '1 hours');" 2>&1); if [[ "$?" -ne "0" ]]; then echo "$out"; fi;
Note that all the out=...
etc is just to only get emails from cron when failure happens.
We don't call drop_chunks on specific hypertables. But this worked fine pre-upgrade.
As for table definitions -- it is considered sensitive information that I cannot publish here, an example of one table with necessary redaction can be found at #1841.
@akamensky Thank you. I'll see if I can find anything given the information we have.
@mkindahl we possibly found another pre-requisite to triggering this issue. In our setup we write data to TimescaleDB from KafkaConnect cluster (multiple sinks writing data to different tables across multiple databases on the same instance).
I've noticed that initial PID where segfault happens is the one of the connection from kafkaconnect to busiest database (unfortunately don't know which table yet). That DB is 90% of all data in TimescaleDB.
Following this we attempted to shutdown all kafkaconnect processes and this appears to have resolved segfaults (at least none happened until now since we started shutting down kafkaconnect processes before calling drop_chunks).
This obviously is not a good solution as kafkaconnect is expected to maintain near real-time data writes to database.
@akamensky Thank you, that is useful information.
Summary of findings this far, working backwards from the crash.
The crash occurs inside cmp_slices_by_dimension_id
likely because a null pointer is used.
Running drop_chunks
only use the call path below to reach cmp_slices_by_dimenson_id
:
cmp_slices_by_dimension_id pg_qsort ts_hypercube_slice_sort ts_hypercube_from_constraints chunk_build_from_tuple_and_stub chunk_tuple_found ts_scanner_scan chunk_create_from_stub chunk_scan_context_add_chunk chunk_scan_ctx_foreach_chunk_stub ts_chunk_get_chunks_in_time_range ts_chunk_do_drop_chunks ts_chunk_drop_chunks . . .
Inside ts_hypercube_from_constraints
the dimension slice is looked up in the metadata cache using ts_dimension_slice_scan_by_id
and these are added to a hypercube. A null pointer can be stored in the hypercube if cc->fd.dimension_slice_id
is not present in the metadata cache when calling ts_dimension_slice_scan_by_id
.
for (i = 0; i < constraints->num_constraints; i++)
{
ChunkConstraint *cc = chunk_constraints_get(constraints, i);
if (is_dimension_constraint(cc))
{
DimensionSlice *slice;
Assert(hc->num_slices < constraints->num_dimension_constraints);
slice = ts_dimension_slice_scan_by_id(cc->fd.dimension_slice_id, mctx);
Assert(slice != NULL && hc->num_slices < hc->capacity);
hc->slices[hc->num_slices++] = slice;
}
}
The return value is asserted in debug builds, but not in release builds. Setting slice
to NULL in the code below after the assert (using a debugger) and continuing the run will indeed generate a
segmentation fault.
A race condition between the insert path and the drop chunks path could generate such a situation if a chunk constraint is added with either a NULL dimension_slice_id
or a tentative (or invalid) dimension_slice_id
before it is added to the dimension_slice
table, but the locking order looks correct for both the insert path and the drop_chunks
path.
After some discussions with @erimatnor we discovered the following that indicate why the dimension slice for a chunk constraint cannot be found in the dimension slice table.
When a new chunk is created as part of an insert, new metadata for a chunk is created using chunk_create_metadata_after_lock
. The function use ts_dimension_slice_insert_multi
to look for existing dimension slices, but does not keep the lock on the read slices. It then adds the new constraint to the chunk_constraint
table. If the dimension slice existed, it will assume that it is still there after the chunk constraint is added.
To see dimension slices that are available for a chunk constraint:
postgres=# SELECT chunk_id, dimension_slice_id, constraint_name, range_start, range_end
postgres-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN
postgres-# _timescaledb_catalog.dimension_slice sl
postgres-# ON dimension_slice_id = sl.id;
-[ RECORD 1 ]------+---------------------
chunk_id | 1
dimension_slice_id | 1
constraint_name | constraint_1
range_start | 1577923200000000
range_end | 1578528000000000
-[ RECORD 2 ]------+---------------------
chunk_id | 1
dimension_slice_id | 2
constraint_name | constraint_2
range_start | -9223372036854775808
range_end | 1073741823
For drop_chunks
, the function chunk_tuple_delete
is used to remove a chunk tuple from the chunk table after the actual chunk is removed. Prior to actually removing the tuple, chunk_tuple_delete
scans the chunk_constraints
table for each dimension slice (using the function ts_chunk_constraint_scan_by_dimension_slice_id
) to see if there are any chunk constraints that use it. If not, the dimension slice is removed. Note that it does not read the dimension_slice
table at this stage and just scan the chunk_constraints
table using an AccessShareLock
.
As a result, if chunk_tuple_delete
runs just after ts_dimension_slice_insert_multi
(in chunk_create_metadata_after_lock
), it will conclude that there are no chunk constraints that reference it and remove the line. When chunk_create_metadata_after_lock
continues, it will add the new constraint to chunk_constraints
and as a result will create invalid metadata since there will be chunk constraints referencing dimension slices that doesn't exists.
By starting two connections to a server, attaching a debugger to one and setting a breakpoint to just after the chunk has been created (on return of hypertable_get_chunk
, for example) you can simulate the race condition.
drop_chunks
to remove the chunk
inserted at step 3 above.If you have a race condition, the output will then look like this:
postgres=# SELECT chunk_id, dimension_slice_id, constraint_name, range_start, range_end
postgres-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN
postgres-# _timescaledb_catalog.dimension_slice sl
postgres-# ON dimension_slice_id = sl.id;
-[ RECORD 1 ]------+---------------------
chunk_id | 1
dimension_slice_id | 1
constraint_name | constraint_1
range_start | [NULL]
range_end | [NULL]
-[ RECORD 2 ]------+---------------------
chunk_id | 1
dimension_slice_id | 2
constraint_name | constraint_2
range_start | -9223372036854775808
range_end | 1073741823
In other words, there is a chunk slice referring to a dimension slice that does not exist.
After this, any statement that reads the chunk constraints and tries to read the associated dimension slices will not find it.
@akamensky We have a patch that we think fixes the issue but it would be very good to verify that it indeed fixes your specific situation. Would it be possible for you to test this?
@mkindahl could you add perhaps compiled binaries for Centos 7 amd64? I don’t know the details of your build environment that produces binaries for rpm install that you distribute. Or at least concise build instructions though not sure when I’d find time for setting up build environment.
To check if there are any chunk constraints that are missing their dimension slices, run the following query.
SELECT chunk_id, dimension_slice_id, constraint_name
FROM _timescaledb_catalog.chunk_constraint
LEFT JOIN _timescaledb_catalog.dimension_slice sl
ON dimension_slice_id = sl.id
WHERE sl.id is NULL;
If there are any chunk constraints that refer to dimension slices ids that do not exist, you will get a list of the chunk id, the dimension slice id that is missing, and the constraint name of the chunk constraint that refer to the dimension slice id.
chunk_id | dimension_slice_id | constraint_name
----------+--------------------+-----------------
2 | 1 | constraint_1
(1 row)
@akamensky Investigating if I can get an RPM for you.
Given the missing dimension slices above, it is possible to extract and parse the constraints from the pg_constraints
it should be possible to re-construct the missing dimension slices using the following query:
WITH missing AS (SELECT chunk_id
, dimension_slice_id
, constraint_name
, pg_get_expr(conbin,conrelid) AS constraint_expr
FROM _timescaledb_catalog.chunk_constraint
LEFT JOIN _timescaledb_catalog.dimension_slice sl
ON dimension_slice_id = sl.id
JOIN pg_constraint ON conname = constraint_name
WHERE dimension_slice_id IS NOT NULL
AND sl.id IS NULL),
unparsed AS (SELECT chunk_id
, dimension_slice_id
, constraint_name
, COALESCE(SUBSTRING(constraint_expr, '(\w+)\s*(?:>=|<)'), SUBSTRING(constraint_expr, '"([^"]+)"\s*(?:>=|<)')) AS column_name
, (SELECT SUBSTRING(constraint_expr, $$>=\s*'([\d\s:+-]+)'$$)) AS lower_range
, (SELECT SUBSTRING(constraint_expr, $$<\s*'([\d\s:+-]+)'$$)) AS upper_range
FROM missing)
SELECT dimension_slice_id
, di.id AS dimension_id
, CASE di.column_type
WHEN 'bigint'::regtype THEN lower_range::bigint
WHEN 'timestamptz'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::timestamptz)
WHEN 'timestamp'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::timestamp::timestamptz)
WHEN 'date'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::date::timestamptz)
ELSE NULL
END AS range_start
, CASE di.column_type
WHEN 'bigint'::regtype THEN upper_range::bigint
WHEN 'timestamptz'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::timestamptz)
WHEN 'timestamp'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::timestamp)
WHEN 'date'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::date::timestamptz)
ELSE NULL
END AS range_end
FROM unparsed JOIN _timescaledb_catalog.dimension di USING (column_name)
WHERE column_name IS NOT NULL;
If the result set contain any rows, there are the missing rows in the _timescaledb_catalog.dimension_slice
table.
@akamensky If it is possible to for you to run the query below before drop_chunks
and see if any rows are missing it would also serve as verification.
If there are any rows output and you have a crash, then the problem described above is indeed what you have. It will also output constraints for the dimension slices that are missing, which would be useful for us to see to know what kind of dimensions that are missing.
SELECT chunk_id
, dimension_slice_id
, constraint_name
, pg_get_expr(conbin,conrelid)
FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
JOIN pg_constraint ON constraint_name = conname
WHERE sl.id IS NULL;
@mkindahl I am not sure we have seen this after last re-creation of tables (as a workaround described above). I am taking a short break at the moment and will be able to only confirm (and run query if issues has came back) some time next week.
@akamensky We think that the issue you have is solved by the PR referenced above and will be included in 1.7.2 (unless something unexpected happens) so I will close this issue as fixed for now. Please re-open the issue if you discover that it is still present in 1.7.2.
Thank you @mkindahl
@mkindahl Although it is closed already I will this here as well. We did not upgrade our production instance yet, and yesterday we got a crash there at the same time when drop_chunks is scheduled to run. I've ran the query you suggested above on a DB connection to which got a segfault and got this:
su - postgres
Last login: Tue Jun 23 09:36:46 HKT 2020 on pts/1
-bash-4.2$ psql
psql (12.3)
Type "help" for help.
postgres=# \c dbname
You are now connected to database "dbname" as user "postgres".
dbname=# SELECT chunk_id
dbname-# , dimension_slice_id
dbname-# , constraint_name
dbname-# , pg_get_expr(conbin,conrelid)
dbname-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
dbname-# JOIN pg_constraint ON constraint_name = conname
dbname-# WHERE sl.id IS NULL;
chunk_id | dimension_slice_id | constraint_name | pg_get_expr
----------+--------------------+-----------------+-------------
(0 rows)
dbname=#
Not sure whether that is expected result or not.
@akamensky No, it was not the expected result. Do you know where the crash happened? Just to verify that it is not a different bug.
Everything looks the same as all previous crashes and it happens late night when no one is around to investigate.
The crash is still inside cmp_slices_by_dimension_id
?
Looks like it (didn't get the core dump for this crash, but dmesg points to the same). FYI we are on the way to upgrade 1.7.1 to 1.7.2 is this still considered as possible fix for this issue (even with those queries returning o rows)?
@akamensky It is, unfortunately, hard to tell with the limited information we have. The crash occurs because the dimension slice does not exist, there are not many ways this can happen, and this fixes one of those cases. Even if your query does not return any rows, it can still be the issue if dimension slices are removed when not expected (and they might be re-added by the insert thread, hiding the problem).
That said, the fix above is still an issue even if the crash you're experiencing is not the same.
If you upgrade and discover that it still crashes, we will re-open this bug and try to figure out if there are any more locks missing.
Agree. In case it doesn't fix the issue I think would be better to get a debug symbols for the .so. So that we could get a full stack trace from the core dump (which we can get on our side)
Closing the issue for now. Please reopen if you discover that the issue is not resolved in 1.7.2.
@mkindahl we still getting crash on 1.7.2. We pushed upgrade through all environments last week and most of the week they ran just fine, but crashed last night with:
[ +38.508593] postmaster[1792]: segfault at 4 ip 00007f52bbb6a659 sp 00007ffe554825a8 error 4 in timescaledb-1.7.2.so[7f52bbb3c000+6a000]
Running the query above:
dbname=# SELECT chunk_id
dbname-# , dimension_slice_id
dbname-# , constraint_name
dbname-# , pg_get_expr(conbin,conrelid)
dbname-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
dbname-# JOIN pg_constraint ON constraint_name = conname
dbname-# WHERE sl.id IS NULL;
chunk_id | dimension_slice_id | constraint_name | pg_get_expr
----------+--------------------+-------------------+-----------------------------------------------------------------------
332541 | 128285 | constraint_128285 | (_timescaledb_internal.get_partition_hash("surfaceName") < 536870911)
(1 row)
One other instance crashed with another segfault message, which also happens exactly when we call drop_chunks
:
[Jul21 23:00] postmaster[12746]: segfault at 7ffcb756fe70 ip 00000000004df08f sp 00007ffcb756fe60 error 6 in postgres[400000+735000]
This one with above query returns empty results.
Edit: I enabled core dump on that host so need to wait and see the stack trace for this one.
Core dump from the first crash stack trace is:
Core was generated by `postgres: postgres dbname [local] SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fba0b875659 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0 0x00007fba0b875659 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#1 0x00000000008bd8cd in pg_qsort ()
#2 0x00007fba0b87591a in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#3 0x00007fba0b861f31 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#4 0x00007fba0b861ff3 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#5 0x00007fba0b885f39 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#6 0x00007fba0b861c73 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#7 0x00007fba0b861d2f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#8 0x00007fba0b861b55 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#9 0x00007fba0b864599 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#10 0x00007fba0b865b4e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#11 0x00007fba0b866069 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()
@mkindahl I am unable to reopen this issue (there is no button for me to reopen, I guess Github repo configuration does not allow non-owners to reopen).
@akamensky I'm reopening since we have a stack trace and it crashes in 1.7.2. Strange that you cannot re-open. We should check that.
@mkindahl thanks. The segfault we see in our prod (I mentioned above) raised as a separate issue #2143 since stack trace is very different. But the one in staging looks still very similar to what we had before and it is on 1.7.2.
@mkindahl we still getting crash on 1.7.2. We pushed upgrade through all environments last week and most of the week they ran just fine, but crashed last night with:
[ +38.508593] postmaster[1792]: segfault at 4 ip 00007f52bbb6a659 sp 00007ffe554825a8 error 4 in timescaledb-1.7.2.so[7f52bbb3c000+6a000]
Running the query above:
dbname=# SELECT chunk_id dbname-# , dimension_slice_id dbname-# , constraint_name dbname-# , pg_get_expr(conbin,conrelid) dbname-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id dbname-# JOIN pg_constraint ON constraint_name = conname dbname-# WHERE sl.id IS NULL; chunk_id | dimension_slice_id | constraint_name | pg_get_expr ----------+--------------------+-------------------+----------------------------------------------------------------------- 332541 | 128285 | constraint_128285 | (_timescaledb_internal.get_partition_hash("surfaceName") < 536870911) (1 row)
@akamensky You can get this effect because a previous concurrent execution of INSERT
and drop_chunks
can orphan some dimension slices.
You should be able to repair that database using the function below, but keep in mind that it is just tested on some basic cases, so be careful about what you run it against.
-- Recreate missing dimension slices that might be missing due to a
-- bug that is fixed in this release. If the dimension slice table is
-- broken and there are dimension slices missing from the table, we
-- will repair it by:
--
-- 1. Finding all chunk constraints that have missing dimension
-- slices and extract the constraint expression from the
-- associated constraint.
--
-- 2. Parse the constraint expression and extract the column name,
-- and upper and lower range values as text.
--
-- 3. Use the column type to construct the range values (UNIX
-- microseconds) from these values.
CREATE PROCEDURE repair_dimension_slice()
LANGUAGE SQL
AS $BODY$
INSERT INTO _timescaledb_catalog.dimension_slice
WITH
-- All dimension slices that are mentioned in the chunk_constraint
-- table but are missing from the dimension_slice table.
missing_slices AS (
SELECT hypertable_id,
chunk_id,
dimension_slice_id,
constraint_name,
attname AS column_name,
pg_get_expr(conbin, conrelid) AS constraint_expr
FROM _timescaledb_catalog.chunk_constraint cc
JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id
JOIN pg_constraint ON conname = constraint_name
JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name
JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid
WHERE
dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice)
),
-- Unparsed range start and end for each dimension slice id that
-- is missing.
unparsed_missing_slices AS (
SELECT di.id AS dimension_id,
dimension_slice_id,
constraint_name,
column_type,
column_name,
(SELECT SUBSTRING(constraint_expr, $$>=\s*'?([\w\d\s:+-]+)'?$$)) AS range_start,
(SELECT SUBSTRING(constraint_expr, $$<\s*'?([\w\d\s:+-]+)'?$$)) AS range_end
FROM missing_slices JOIN _timescaledb_catalog.dimension di USING (hypertable_id, column_name)
)
SELECT DISTINCT
dimension_slice_id,
dimension_id,
CASE
WHEN column_type = 'timestamptz'::regtype THEN
_timescaledb_internal.time_to_internal(range_start::timestamptz)
WHEN column_type = 'timestamp'::regtype THEN
_timescaledb_internal.time_to_internal(range_start::timestamp)
WHEN column_type = 'date'::regtype THEN
_timescaledb_internal.time_to_internal(range_start::date)
ELSE
CASE
WHEN range_start IS NULL
THEN -9223372036854775808
ELSE _timescaledb_internal.time_to_internal(range_start::bigint)
END
END AS range_start,
CASE
WHEN column_type = 'timestamptz'::regtype THEN
_timescaledb_internal.time_to_internal(range_end::timestamptz)
WHEN column_type = 'timestamp'::regtype THEN
_timescaledb_internal.time_to_internal(range_end::timestamp)
WHEN column_type = 'date'::regtype THEN
_timescaledb_internal.time_to_internal(range_end::date)
ELSE
CASE WHEN range_end IS NULL
THEN 9223372036854775807
ELSE _timescaledb_internal.time_to_internal(range_end::bigint)
END
END AS range_end
FROM unparsed_missing_slices;
$BODY$;
Update: Making some changes to repair script above so that it by default tries to convert to bigint
and only handles timestamps differently.
ahh, thanks, let me try that out tomorrow.
@mkindahl Does not seem to work:
dbname=# call repair_dimension_slice();
ERROR: null value in column "range_start" violates not-null constraint
DETAIL: Failing row contains (21303, 26, null, null).
CONTEXT: SQL function "repair_dimension_slice" statement 1
Edited the script above so that it by default tries to convert to bigint
. Is likely going to avoid the bad NULL
.
@mkindahl we've rebuilt the database in our staging environment with to make sure any leftover missing relations are gone. In 24 hours we again get missing relations:
dbname=# SELECT chunk_id
, dimension_slice_id
, constraint_name
, pg_get_expr(conbin,conrelid)
FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
JOIN pg_constraint ON constraint_name = conname
WHERE sl.id IS NULL;
chunk_id | dimension_slice_id | constraint_name | pg_get_expr
----------+--------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------
7647 | 3001 | constraint_3001 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
7651 | 2999 | constraint_2999 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
7653 | 3008 | constraint_3008 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
7657 | 3004 | constraint_3004 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
7649 | 3002 | constraint_3002 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
7650 | 2997 | constraint_2997 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
7654 | 3007 | constraint_3007 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
7656 | 3005 | constraint_3005 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
(8 rows)
The above is with 1.7.2. Which means the issue still exists there.
@mkindahl we also get another crash when calling TRUNCATE or DROP TABLE on table that seems to have those missing relations:
gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1597213605.1616
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 1616]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres dbname [local] TRUNCATE TABLE '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0 0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#1 0x00007fa4c0750a2c in ts_chunk_delete_by_hypertable_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#2 0x00007fa4c07702b4 in process_truncate () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#3 0x00007fa4c076f862 in timescaledb_ddl_command_start () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#4 0x00000000007632d6 in PortalRunUtility ()
#5 0x0000000000763d27 in PortalRunMulti ()
#6 0x0000000000764905 in PortalRun ()
#7 0x0000000000760af5 in exec_simple_query ()
#8 0x0000000000761d92 in PostgresMain ()
#9 0x0000000000484022 in ServerLoop ()
#10 0x00000000006f14c3 in PostmasterMain ()
#11 0x0000000000484f23 in main ()
(gdb)
Not sure if this is related, please advise.
@akamensky It seems related. I suspect that there is a race between drop_chunks
calls as well. Do I understand it correctly in that you are running drop_chunks
in parallel with TRUNCATE TABLE
and/or DROP TABLE
on the hypertable that you're running drop_chunks
on?
@mkindahl Does not seem to work:
dbname=# call repair_dimension_slice(); ERROR: null value in column "range_start" violates not-null constraint DETAIL: Failing row contains (21303, 26, null, null). CONTEXT: SQL function "repair_dimension_slice" statement 1
Yeah, range_start
should not be NULL, so not surprising that it fails. Could you add the result of running the missing_slices
query above?
@mkindahl we also get another crash when calling TRUNCATE or DROP TABLE on table that seems to have those missing relations:
gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1597213605.1616 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done. (no debugging symbols found)...done. warning: core file may not match specified executable file. [New LWP 1616] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `postgres: postgres dbname [local] TRUNCATE TABLE '. Program terminated with signal 11, Segmentation fault. #0 0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64 (gdb) where #0 0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so #1 0x00007fa4c0750a2c in ts_chunk_delete_by_hypertable_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so #2 0x00007fa4c07702b4 in process_truncate () from /usr/pgsql-12/lib/timescaledb-1.7.2.so #3 0x00007fa4c076f862 in timescaledb_ddl_command_start () from /usr/pgsql-12/lib/timescaledb-1.7.2.so #4 0x00000000007632d6 in PortalRunUtility () #5 0x0000000000763d27 in PortalRunMulti () #6 0x0000000000764905 in PortalRun () #7 0x0000000000760af5 in exec_simple_query () #8 0x0000000000761d92 in PostgresMain () #9 0x0000000000484022 in ServerLoop () #10 0x00000000006f14c3 in PostmasterMain () #11 0x0000000000484f23 in main () (gdb)
Not sure if this is related, please advise.
Looking closer at this one, it seems to be a consequence of the missing dimension slice, not a direct bug. I can easily reproduce it by manually dropping the dimension slices though and it is obviously triggered because the dimension slice cannot be found.
(gdb) l
2655 .lockmode = LockTupleExclusive,
2656 .waitpolicy = LockWaitBlock,
2657 };
2658 DimensionSlice *slice =
2659 ts_dimension_slice_scan_by_id_and_lock(cc->fd.dimension_slice_id,
2660 &tuplock,
2661 CurrentMemoryContext);
2662 if (ts_chunk_constraint_scan_by_dimension_slice_id(slice->fd.id,
2663 NULL,
2664 CurrentMemoryContext) == 0)
(gdb) p slice
$2 = (DimensionSlice *) 0x0
@mkindahl On a completely clean rebuild of Timescale using 1.7.2 -- the issue did not happen for a few days at first, but came back with same stacktrace as above. There were no missing slices (as completely new rebuild). Once the error returned there are:
db_name=# SELECT hypertable_id,
db_name-# chunk_id,
db_name-# dimension_slice_id,
db_name-# constraint_name,
db_name-# attname AS column_name,
db_name-# pg_get_expr(conbin, conrelid) AS constraint_expr
db_name-# FROM _timescaledb_catalog.chunk_constraint cc
db_name-# JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id
db_name-# JOIN pg_constraint ON conname = constraint_name
db_name-# JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name
db_name-# JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid
db_name-# WHERE
db_name-# dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice);
hypertable_id | chunk_id | dimension_slice_id | constraint_name | column_name | constraint_expr
---------------+----------+--------------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------
-----
3 | 9103 | 3186 | constraint_3186 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
3 | 9104 | 3183 | constraint_3183 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
7 | 9105 | 3185 | constraint_3185 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
3 | 9106 | 3182 | constraint_3182 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
7 | 9107 | 3184 | constraint_3184 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
3 | 9108 | 3188 | constraint_3188 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
7 | 9109 | 3180 | constraint_3180 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
7 | 9110 | 3187 | constraint_3187 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
(8 rows)
db_name=#
It does look that the issue is still there.
Do I understand it correctly in that you are running drop_chunks in parallel with TRUNCATE TABLE and/or DROP TABLE on the hypertable that you're running drop_chunks on?
No. It is running truncate/drop on table outside of running drop_chunks. But at the time when error is present (thus missing relations), which would of course be there, because truncate or drop will attempt to delete them.
@mkindahl On a completely clean rebuild of Timescale using 1.7.2 -- the issue did not happen for a few days at first, but came back with same stacktrace as above. There were no missing slices (as completely new rebuild). Once the error returned there are:
db_name=# SELECT hypertable_id, db_name-# chunk_id, db_name-# dimension_slice_id, db_name-# constraint_name, db_name-# attname AS column_name, db_name-# pg_get_expr(conbin, conrelid) AS constraint_expr db_name-# FROM _timescaledb_catalog.chunk_constraint cc db_name-# JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id db_name-# JOIN pg_constraint ON conname = constraint_name db_name-# JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name db_name-# JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid db_name-# WHERE db_name-# dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice); hypertable_id | chunk_id | dimension_slice_id | constraint_name | column_name | constraint_expr ---------------+----------+--------------------+-----------------+--------------+--------------------------------------------------------------------------------------------------------------------------------------------------- ----- 3 | 9103 | 3186 | constraint_3186 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911) 3 | 9104 | 3183 | constraint_3183 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733) 7 | 9105 | 3185 | constraint_3185 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911) 3 | 9106 | 3182 | constraint_3182 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127 33)) 7 | 9107 | 3184 | constraint_3184 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733) 3 | 9108 | 3188 | constraint_3188 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182 2)) 7 | 9109 | 3180 | constraint_3180 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127 33)) 7 | 9110 | 3187 | constraint_3187 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182 2)) (8 rows) db_name=#
It does look that the issue is still there.
And this is with concurrent INSERT
being run? Not with concurrent drop_chunks
calls?
And this is with concurrent INSERT being run? Not with concurrent drop_chunks calls?
Yes, we changed not that long ago from running drop_chunks on multiple databases (1 call per database) from parallel to sequential (so it would wait until first drop_chunks succeeded OR failed before starting next one with delay). But that did not fix the issue. Meantime concurrent INSERT
are present in both cases. Thus single drop_chunks + parallel high rate of INSERT
appears to be a common denominator here.
UPD: internally we concluded that slow underlying storage seems to increase likelihood of this issue happening. Our staging environment is on relatively slow disks, and this issue would reappear there much faster than on other environments that use much faster storage.
And this is with concurrent INSERT being run? Not with concurrent drop_chunks calls?
Yes, we changed not that long ago from running drop_chunks on multiple databases (1 call per database) from parallel to sequential (so it would wait until first drop_chunks succeeded OR failed before starting next one with delay). But that did not fix the issue. Meantime concurrent
INSERT
are present in both cases. Thus single drop_chunks + parallel high rate ofINSERT
appears to be a common denominator here.UPD: internally we concluded that slow underlying storage seems to increase likelihood of this issue happening. Our staging environment is on relatively slow disks, and this issue would reappear there much faster than on other environments that use much faster storage.
This is then likely to be a missing lock still, and it is true that a slow underlying storage would increase the likelihood of a race condition because of a missing lock. We have fixed the function that reads hypercube information from the dimension slice table and ensured that it takes tuple locks, so hopefully we have covered all cases now and it should be available in 1.7.3 (that we're in the process of releasing).
If you can post a message when you've tested this, regardless of whether it works or not, we would be very grateful. I'm keeping the bug closed for now, but we will re-open it if it turns out that there are still lingering issues.
Thanks @mkindahl I've seen the release. We are going to upgrade staging and uat environments to 1.7.3 today and will observe for next week. Before it took a week or two to get the error back, so please allow some time. Once we observe for enough time I will confirm here. Thanks.
@mkindahl Just upgraded to 1.7.3, did not rebuild the data (assumed that the segfault itself would be fixed in this version). But still no:
[544437.027061] postmaster[25376]: segfault at 4 ip 00007f95f577eb69 sp 00007ffe6c4f6c28 error 4 in timescaledb-1.7.3.so[7f95f5750000+6a000]
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 25376]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres db_name [local] SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0 0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#1 0x00000000008bd8cd in pg_qsort ()
#2 0x00007f95f577ee52 in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#3 0x00007f95f576b041 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#4 0x00007f95f576b103 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#5 0x00007f95f578f6c9 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#6 0x00007f95f576ad83 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#7 0x00007f95f576ae3f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#8 0x00007f95f576ac65 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#9 0x00007f95f576d4f9 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#10 0x00007f95f576ed1e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#11 0x00007f95f576f559 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()
That is with missing slices as caused by 1.7.2. We will rebuild the data, and see whether slices would go missing still.
With rebuilt data we did not get any missing relations / segfaults yet. But I see in logs following errors:
2020-09-01 13:00:01.553 TZ [8705] ERROR: query returned no rows
2020-09-01 13:00:01.553 TZ [8705] CONTEXT: PL/pgSQL function _timescaledb_internal.dimension_slice_get_constraint_sql(integer) line 9 at SQL statement
PL/pgSQL function _timescaledb_internal.chunk_constraint_add_table_constraint(_timescaledb_catalog.chunk_constraint) line 15 at assignment
....
< followed by a few INSERT queries into hypertables >
and
2020-09-01 09:00:01.706 TZ [6160] ERROR: could not open relation with OID 2912420
...
< followed by insert statement into one of hypertables >
also
2020-09-01 16:00:03.501 TZ [45626] ERROR: deadlock detected
2020-09-01 16:00:03.501 TZ [45626] DETAIL: Process 45626 waits for RowExclusiveLock on relation 2918460 of database 2887017; blocked by process 48154.
Process 48154 waits for AccessExclusiveLock on relation 2918478 of database 2887017; blocked by process 45626.
...
< followed by insert statement into one of hypertables >
These are all at the times when drop_chunks
is being called.
@mkindahl 1.7.3 did not resolve the issue:
After a few days of no segfault we get it again:
[Sep 3 14:59] postmaster[129640]: segfault at 4 ip 00007f95f577eb69 sp 00007ffe6c4f6c28 error 4 in timescaledb-1.7.3.so[7f95f5750000+6a000]
Stack trace from core dump of this segfault:
gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1599116401.129640
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 129640]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres db_name [local] SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0 0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#1 0x00000000008bd8cd in pg_qsort ()
#2 0x00007f95f577ee52 in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#3 0x00007f95f576b041 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#4 0x00007f95f576b103 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#5 0x00007f95f578f6c9 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#6 0x00007f95f576ad83 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#7 0x00007f95f576ae3f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#8 0x00007f95f576ac65 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#9 0x00007f95f576d4f9 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#10 0x00007f95f576ed1e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#11 0x00007f95f576f559 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()
(gdb)
From the log:
2020-09-03 15:00:01.931 HKT [129640] LOG: connection authorized: user=postgres database=db_name application_name=psql
2020-09-03 15:00:01.966 HKT [22475] LOG: server process (PID 129640) was terminated by signal 11: Segmentation fault
2020-09-03 15:00:01.966 HKT [22475] DETAIL: Failed process was running: SELECT drop_chunks(interval '1 hours');
2020-09-03 15:00:01.966 HKT [22475] LOG: terminating any other active server processes
From psql:
postgres=# \c db_name
You are now connected to database "db_name" as user "postgres".
db_name=# \dx
List of installed extensions
Name | Version | Schema | Description
-------------+---------+------------+-------------------------------------------------------------------
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
timescaledb | 1.7.3 | public | Enables scalable inserts and complex queries for time-series data
(2 rows)
db_name=# SELECT chunk_id
db_name-# , dimension_slice_id
db_name-# , constraint_name
db_name-# , pg_get_expr(conbin,conrelid)
db_name-# FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
db_name-# JOIN pg_constraint ON constraint_name = conname
db_name-# WHERE sl.id IS NULL;
chunk_id | dimension_slice_id | constraint_name | pg_get_expr
----------+--------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------
7857 | 3695 | constraint_3695 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
7861 | 3699 | constraint_3699 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
7867 | 3690 | constraint_3690 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
7870 | 3686 | constraint_3686 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
(4 rows)
db_name=#
@akamensky Thanks for the information. Reopening.
Relevant system information:
postgres --version
): 12.3\dx
inpsql
): 1.7.1Describe the bug After the upgrade of Postgresql (10.9 -> 12.3) and TimescaleDB extension (1.6.1 -> 1.7.1) we see repeated segfaults when executing drop_chunks:
Under GDB the segfault yields:
To Reproduce Not certain how to reproduce it on any other setup. This happens currently in our staging environment.
Expected behavior Chunks dropped and disk space freed.
Actual behavior Segfault