Segfault when drop_chunks with 1.7.1

akamensky commented 4 years ago

Relevant system information:

OS: Centos 7.5
PostgreSQL version (output of postgres --version): 12.3
TimescaleDB version (output of \dx in psql): 1.7.1
Installation method: YUM

Describe the bug After the upgrade of Postgresql (10.9 -> 12.3) and TimescaleDB extension (1.6.1 -> 1.7.1) we see repeated segfaults when executing drop_chunks:

Under GDB the segfault yields:

Program received signal SIGSEGV, Segmentation fault.
0x00007fa710efc259 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.1.so

To Reproduce Not certain how to reproduce it on any other setup. This happens currently in our staging environment.

Expected behavior Chunks dropped and disk space freed.

Actual behavior Segfault

mkindahl commented 4 years ago

Can you please add a stack trace to the issue and show the call to drop_chunks that you use?

akamensky commented 4 years ago

@mkindahl i would provide a stacktrace in top message if timescale rpms were coming with debug symbols in them (I.e. in “debug” package). However there are none. Top message already contains all of what you’d see without debug symbols.

akamensky commented 4 years ago

FWIW we “fixed” the issue by dropping tables and re-creating them one by one (and checking if drop_chunks works after every table). Though if that starts happening in our production environment that would be 100% unacceptable.

mkindahl commented 4 years ago

It shouldn't be necessary to fix issues in this way, so we should fix it.

Looking into if we can add debug symbols to our packages, but do you have the call to drop_chunks that you used and the definition of the hypertable/continuous aggregate that you applied the drop_chunks to? That would at least reduce the amount of code I have to go through.

akamensky commented 4 years ago

Drop chunks is done via:

00 */1 * * 1-5 postgres out=$(psql -U postgres -d db_name -c "SELECT drop_chunks(interval '1 hours');" 2>&1); if [[ "$?" -ne "0" ]]; then echo "$out"; fi;

Note that all the out=... etc is just to only get emails from cron when failure happens.

We don't call drop_chunks on specific hypertables. But this worked fine pre-upgrade.

As for table definitions -- it is considered sensitive information that I cannot publish here, an example of one table with necessary redaction can be found at #1841.

mkindahl commented 4 years ago

@akamensky Thank you. I'll see if I can find anything given the information we have.

akamensky commented 4 years ago

@mkindahl we possibly found another pre-requisite to triggering this issue. In our setup we write data to TimescaleDB from KafkaConnect cluster (multiple sinks writing data to different tables across multiple databases on the same instance).

I've noticed that initial PID where segfault happens is the one of the connection from kafkaconnect to busiest database (unfortunately don't know which table yet). That DB is 90% of all data in TimescaleDB.

Following this we attempted to shutdown all kafkaconnect processes and this appears to have resolved segfaults (at least none happened until now since we started shutting down kafkaconnect processes before calling drop_chunks).

This obviously is not a good solution as kafkaconnect is expected to maintain near real-time data writes to database.

mkindahl commented 4 years ago

@akamensky Thank you, that is useful information.

mkindahl commented 4 years ago

Summary of findings this far, working backwards from the crash.

The crash occurs inside cmp_slices_by_dimension_id likely because a null pointer is used.

Running drop_chunks only use the call path below to reach cmp_slices_by_dimenson_id:

cmp_slices_by_dimension_id
pg_qsort
ts_hypercube_slice_sort
ts_hypercube_from_constraints
chunk_build_from_tuple_and_stub
chunk_tuple_found
ts_scanner_scan
chunk_create_from_stub
chunk_scan_context_add_chunk
chunk_scan_ctx_foreach_chunk_stub
ts_chunk_get_chunks_in_time_range
ts_chunk_do_drop_chunks
ts_chunk_drop_chunks
   .
   .
   .

Inside ts_hypercube_from_constraints the dimension slice is looked up in the metadata cache using ts_dimension_slice_scan_by_id and these are added to a hypercube. A null pointer can be stored in the hypercube if cc->fd.dimension_slice_id is not present in the metadata cache when calling ts_dimension_slice_scan_by_id.

for (i = 0; i < constraints->num_constraints; i++)
{
    ChunkConstraint *cc = chunk_constraints_get(constraints, i);

    if (is_dimension_constraint(cc))
    {
        DimensionSlice *slice;

        Assert(hc->num_slices < constraints->num_dimension_constraints);
        slice = ts_dimension_slice_scan_by_id(cc->fd.dimension_slice_id, mctx);
        Assert(slice != NULL && hc->num_slices < hc->capacity);
        hc->slices[hc->num_slices++] = slice;
    }
}

The return value is asserted in debug builds, but not in release builds. Setting slice to NULL in the code below after the assert (using a debugger) and continuing the run will indeed generate a segmentation fault.

A race condition between the insert path and the drop chunks path could generate such a situation if a chunk constraint is added with either a NULL dimension_slice_id or a tentative (or invalid) dimension_slice_id before it is added to the dimension_slice table, but the locking order looks correct for both the insert path and the drop_chunks path.

mkindahl commented 4 years ago

After some discussions with @erimatnor we discovered the following that indicate why the dimension slice for a chunk constraint cannot be found in the dimension slice table.

When a new chunk is created as part of an insert, new metadata for a chunk is created using chunk_create_metadata_after_lock. The function use ts_dimension_slice_insert_multi to look for existing dimension slices, but does not keep the lock on the read slices. It then adds the new constraint to the chunk_constraint table. If the dimension slice existed, it will assume that it is still there after the chunk constraint is added.

To see dimension slices that are available for a chunk constraint:

postgres=# SELECT chunk_id, dimension_slice_id, constraint_name, range_start, range_end
postgres-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN 
postgres-#        _timescaledb_catalog.dimension_slice sl
postgres-#     ON dimension_slice_id = sl.id;
-[ RECORD 1 ]------+---------------------
chunk_id           | 1
dimension_slice_id | 1
constraint_name    | constraint_1
range_start        | 1577923200000000
range_end          | 1578528000000000
-[ RECORD 2 ]------+---------------------
chunk_id           | 1
dimension_slice_id | 2
constraint_name    | constraint_2
range_start        | -9223372036854775808
range_end          | 1073741823

For drop_chunks, the function chunk_tuple_delete is used to remove a chunk tuple from the chunk table after the actual chunk is removed. Prior to actually removing the tuple, chunk_tuple_delete scans the chunk_constraints table for each dimension slice (using the function ts_chunk_constraint_scan_by_dimension_slice_id) to see if there are any chunk constraints that use it. If not, the dimension slice is removed. Note that it does not read the dimension_slice table at this stage and just scan the chunk_constraints table using an AccessShareLock.

As a result, if chunk_tuple_delete runs just after ts_dimension_slice_insert_multi (in chunk_create_metadata_after_lock), it will conclude that there are no chunk constraints that reference it and remove the line. When chunk_create_metadata_after_lock continues, it will add the new constraint to chunk_constraints and as a result will create invalid metadata since there will be chunk constraints referencing dimension slices that doesn't exists.

By starting two connections to a server, attaching a debugger to one and setting a breakpoint to just after the chunk has been created (on return of hypertable_get_chunk, for example) you can simulate the race condition.

Create a hypertable and fill it with some data.
Set the breakpoint above.
Run an insert that creates a new chunk.
Debugger will stop at the breakpoint.
In a separate session, run a drop_chunks to remove the chunk inserted at step 3 above.
Resume execution of the insert using the debugger.

If you have a race condition, the output will then look like this:

postgres=# SELECT chunk_id, dimension_slice_id, constraint_name, range_start, range_end
postgres-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN 
postgres-#        _timescaledb_catalog.dimension_slice sl
postgres-#     ON dimension_slice_id = sl.id;
-[ RECORD 1 ]------+---------------------
chunk_id           | 1
dimension_slice_id | 1
constraint_name    | constraint_1
range_start        | [NULL]
range_end          | [NULL]
-[ RECORD 2 ]------+---------------------
chunk_id           | 1
dimension_slice_id | 2
constraint_name    | constraint_2
range_start        | -9223372036854775808
range_end          | 1073741823

In other words, there is a chunk slice referring to a dimension slice that does not exist.

After this, any statement that reads the chunk constraints and tries to read the associated dimension slices will not find it.

mkindahl commented 4 years ago

@akamensky We have a patch that we think fixes the issue but it would be very good to verify that it indeed fixes your specific situation. Would it be possible for you to test this?

akamensky commented 4 years ago

@mkindahl could you add perhaps compiled binaries for Centos 7 amd64? I don’t know the details of your build environment that produces binaries for rpm install that you distribute. Or at least concise build instructions though not sure when I’d find time for setting up build environment.

mkindahl commented 4 years ago

To check if there are any chunk constraints that are missing their dimension slices, run the following query.

SELECT chunk_id, dimension_slice_id, constraint_name
  FROM _timescaledb_catalog.chunk_constraint
       LEFT JOIN _timescaledb_catalog.dimension_slice sl
       ON dimension_slice_id = sl.id
 WHERE sl.id is NULL;

If there are any chunk constraints that refer to dimension slices ids that do not exist, you will get a list of the chunk id, the dimension slice id that is missing, and the constraint name of the chunk constraint that refer to the dimension slice id.

 chunk_id | dimension_slice_id | constraint_name 
----------+--------------------+-----------------
        2 |                  1 | constraint_1
(1 row)

mkindahl commented 4 years ago

@akamensky Investigating if I can get an RPM for you.

mkindahl commented 4 years ago

Given the missing dimension slices above, it is possible to extract and parse the constraints from the pg_constraints it should be possible to re-construct the missing dimension slices using the following query:

WITH missing AS (SELECT chunk_id
              , dimension_slice_id
              , constraint_name
              , pg_get_expr(conbin,conrelid) AS constraint_expr
               FROM _timescaledb_catalog.chunk_constraint
          LEFT JOIN _timescaledb_catalog.dimension_slice sl
             ON dimension_slice_id = sl.id
               JOIN pg_constraint ON conname = constraint_name
              WHERE dimension_slice_id IS NOT NULL
                AND sl.id IS NULL),
     unparsed AS (SELECT chunk_id
               , dimension_slice_id
               , constraint_name
               , COALESCE(SUBSTRING(constraint_expr, '(\w+)\s*(?:>=|<)'), SUBSTRING(constraint_expr, '"([^"]+)"\s*(?:>=|<)')) AS column_name
               , (SELECT SUBSTRING(constraint_expr, $$>=\s*'([\d\s:+-]+)'$$)) AS lower_range
               , (SELECT SUBSTRING(constraint_expr, $$<\s*'([\d\s:+-]+)'$$)) AS upper_range
            FROM missing)
SELECT dimension_slice_id
     , di.id AS dimension_id
     , CASE di.column_type
       WHEN 'bigint'::regtype THEN lower_range::bigint
       WHEN 'timestamptz'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::timestamptz)
       WHEN 'timestamp'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::timestamp::timestamptz)
       WHEN 'date'::regtype THEN _timescaledb_internal.to_unix_microseconds(lower_range::date::timestamptz)
       ELSE NULL
       END AS range_start
     , CASE di.column_type
       WHEN 'bigint'::regtype THEN upper_range::bigint
       WHEN 'timestamptz'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::timestamptz)
       WHEN 'timestamp'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::timestamp)
       WHEN 'date'::regtype THEN _timescaledb_internal.to_unix_microseconds(upper_range::date::timestamptz)
       ELSE NULL
       END AS range_end
  FROM unparsed JOIN _timescaledb_catalog.dimension di USING (column_name)
 WHERE column_name IS NOT NULL;

If the result set contain any rows, there are the missing rows in the _timescaledb_catalog.dimension_slice table.

mkindahl commented 4 years ago

@akamensky If it is possible to for you to run the query below before drop_chunks and see if any rows are missing it would also serve as verification.

If there are any rows output and you have a crash, then the problem described above is indeed what you have. It will also output constraints for the dimension slices that are missing, which would be useful for us to see to know what kind of dimensions that are missing.

SELECT chunk_id
     , dimension_slice_id
     , constraint_name
     , pg_get_expr(conbin,conrelid)
  FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
  JOIN pg_constraint ON constraint_name = conname
 WHERE sl.id IS NULL;

akamensky commented 4 years ago

@mkindahl I am not sure we have seen this after last re-creation of tables (as a workaround described above). I am taking a short break at the moment and will be able to only confirm (and run query if issues has came back) some time next week.

mkindahl commented 4 years ago

@akamensky We think that the issue you have is solved by the PR referenced above and will be included in 1.7.2 (unless something unexpected happens) so I will close this issue as fixed for now. Please re-open the issue if you discover that it is still present in 1.7.2.

akamensky commented 4 years ago

Thank you @mkindahl

akamensky commented 4 years ago

@mkindahl Although it is closed already I will this here as well. We did not upgrade our production instance yet, and yesterday we got a crash there at the same time when drop_chunks is scheduled to run. I've ran the query you suggested above on a DB connection to which got a segfault and got this:

su - postgres
Last login: Tue Jun 23 09:36:46 HKT 2020 on pts/1
-bash-4.2$ psql 
psql (12.3)
Type "help" for help.

postgres=# \c dbname
You are now connected to database "dbname" as user "postgres".
dbname=# SELECT chunk_id
dbname-#      , dimension_slice_id
dbname-#      , constraint_name
dbname-#      , pg_get_expr(conbin,conrelid)
dbname-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
dbname-#   JOIN pg_constraint ON constraint_name = conname
dbname-#  WHERE sl.id IS NULL;
 chunk_id | dimension_slice_id | constraint_name | pg_get_expr 
----------+--------------------+-----------------+-------------
(0 rows)

dbname=#

Not sure whether that is expected result or not.

mkindahl commented 4 years ago

@akamensky No, it was not the expected result. Do you know where the crash happened? Just to verify that it is not a different bug.

akamensky commented 4 years ago

Everything looks the same as all previous crashes and it happens late night when no one is around to investigate.

mkindahl commented 4 years ago

The crash is still inside cmp_slices_by_dimension_id?

akamensky commented 4 years ago

Looks like it (didn't get the core dump for this crash, but dmesg points to the same). FYI we are on the way to upgrade 1.7.1 to 1.7.2 is this still considered as possible fix for this issue (even with those queries returning o rows)?

mkindahl commented 4 years ago

@akamensky It is, unfortunately, hard to tell with the limited information we have. The crash occurs because the dimension slice does not exist, there are not many ways this can happen, and this fixes one of those cases. Even if your query does not return any rows, it can still be the issue if dimension slices are removed when not expected (and they might be re-added by the insert thread, hiding the problem).

That said, the fix above is still an issue even if the crash you're experiencing is not the same.

If you upgrade and discover that it still crashes, we will re-open this bug and try to figure out if there are any more locks missing.

akamensky commented 4 years ago

Agree. In case it doesn't fix the issue I think would be better to get a debug symbols for the .so. So that we could get a full stack trace from the core dump (which we can get on our side)

mkindahl commented 4 years ago

Closing the issue for now. Please reopen if you discover that the issue is not resolved in 1.7.2.

akamensky commented 4 years ago

@mkindahl we still getting crash on 1.7.2. We pushed upgrade through all environments last week and most of the week they ran just fine, but crashed last night with:

[ +38.508593] postmaster[1792]: segfault at 4 ip 00007f52bbb6a659 sp 00007ffe554825a8 error 4 in timescaledb-1.7.2.so[7f52bbb3c000+6a000]

Running the query above:

dbname=# SELECT chunk_id
dbname-#      , dimension_slice_id
dbname-#      , constraint_name
dbname-#      , pg_get_expr(conbin,conrelid)
dbname-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
dbname-#   JOIN pg_constraint ON constraint_name = conname
dbname-#  WHERE sl.id IS NULL;
 chunk_id | dimension_slice_id |  constraint_name  |                              pg_get_expr                              
----------+--------------------+-------------------+-----------------------------------------------------------------------
   332541 |             128285 | constraint_128285 | (_timescaledb_internal.get_partition_hash("surfaceName") < 536870911)
(1 row)

akamensky commented 4 years ago

One other instance crashed with another segfault message, which also happens exactly when we call drop_chunks:

[Jul21 23:00] postmaster[12746]: segfault at 7ffcb756fe70 ip 00000000004df08f sp 00007ffcb756fe60 error 6 in postgres[400000+735000]

This one with above query returns empty results.

Edit: I enabled core dump on that host so need to wait and see the stack trace for this one.

akamensky commented 4 years ago

Core dump from the first crash stack trace is:

Core was generated by `postgres: postgres dbname [local] SELECT          '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fba0b875659 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x00007fba0b875659 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#1  0x00000000008bd8cd in pg_qsort ()
#2  0x00007fba0b87591a in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#3  0x00007fba0b861f31 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#4  0x00007fba0b861ff3 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#5  0x00007fba0b885f39 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#6  0x00007fba0b861c73 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#7  0x00007fba0b861d2f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#8  0x00007fba0b861b55 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#9  0x00007fba0b864599 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#10 0x00007fba0b865b4e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#11 0x00007fba0b866069 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()

akamensky commented 4 years ago

@mkindahl I am unable to reopen this issue (there is no button for me to reopen, I guess Github repo configuration does not allow non-owners to reopen).

mkindahl commented 4 years ago

@akamensky I'm reopening since we have a stack trace and it crashes in 1.7.2. Strange that you cannot re-open. We should check that.

akamensky commented 4 years ago

@mkindahl thanks. The segfault we see in our prod (I mentioned above) raised as a separate issue #2143 since stack trace is very different. But the one in staging looks still very similar to what we had before and it is on 1.7.2.

mkindahl commented 4 years ago

@mkindahl we still getting crash on 1.7.2. We pushed upgrade through all environments last week and most of the week they ran just fine, but crashed last night with:

[ +38.508593] postmaster[1792]: segfault at 4 ip 00007f52bbb6a659 sp 00007ffe554825a8 error 4 in timescaledb-1.7.2.so[7f52bbb3c000+6a000]

Running the query above:

dbname=# SELECT chunk_id
dbname-#      , dimension_slice_id
dbname-#      , constraint_name
dbname-#      , pg_get_expr(conbin,conrelid)
dbname-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
dbname-#   JOIN pg_constraint ON constraint_name = conname
dbname-#  WHERE sl.id IS NULL;
 chunk_id | dimension_slice_id |  constraint_name  |                              pg_get_expr                              
----------+--------------------+-------------------+-----------------------------------------------------------------------
   332541 |             128285 | constraint_128285 | (_timescaledb_internal.get_partition_hash("surfaceName") < 536870911)
(1 row)

@akamensky You can get this effect because a previous concurrent execution of INSERT and drop_chunks can orphan some dimension slices.

You should be able to repair that database using the function below, but keep in mind that it is just tested on some basic cases, so be careful about what you run it against.

-- Recreate missing dimension slices that might be missing due to a
-- bug that is fixed in this release. If the dimension slice table is
-- broken and there are dimension slices missing from the table, we
-- will repair it by:
--
--    1. Finding all chunk constraints that have missing dimension
--       slices and extract the constraint expression from the
--       associated constraint.
--       
--    2. Parse the constraint expression and extract the column name,
--       and upper and lower range values as text.
--       
--    3. Use the column type to construct the range values (UNIX
--       microseconds) from these values.
CREATE PROCEDURE repair_dimension_slice()
LANGUAGE SQL
AS $BODY$
INSERT INTO _timescaledb_catalog.dimension_slice
WITH
   -- All dimension slices that are mentioned in the chunk_constraint
   -- table but are missing from the dimension_slice table.
   missing_slices AS (
      SELECT hypertable_id,
             chunk_id,
         dimension_slice_id,
         constraint_name,
         attname AS column_name,
         pg_get_expr(conbin, conrelid) AS constraint_expr
      FROM _timescaledb_catalog.chunk_constraint cc
      JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id
      JOIN pg_constraint ON conname = constraint_name
      JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name
      JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid
      WHERE
     dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice)
   ),

  -- Unparsed range start and end for each dimension slice id that
  -- is missing.
   unparsed_missing_slices AS (
      SELECT di.id AS dimension_id,
             dimension_slice_id,
             constraint_name,
         column_type,
         column_name,
         (SELECT SUBSTRING(constraint_expr, $$>=\s*'?([\w\d\s:+-]+)'?$$)) AS range_start,
         (SELECT SUBSTRING(constraint_expr, $$<\s*'?([\w\d\s:+-]+)'?$$)) AS range_end
    FROM missing_slices JOIN _timescaledb_catalog.dimension di USING (hypertable_id, column_name)
   )
SELECT DISTINCT
       dimension_slice_id,
       dimension_id,
       CASE
       WHEN column_type = 'timestamptz'::regtype THEN
            _timescaledb_internal.time_to_internal(range_start::timestamptz)
       WHEN column_type = 'timestamp'::regtype THEN
            _timescaledb_internal.time_to_internal(range_start::timestamp)
       WHEN column_type = 'date'::regtype THEN
            _timescaledb_internal.time_to_internal(range_start::date)
       ELSE
            CASE
        WHEN range_start IS NULL
        THEN -9223372036854775808
        ELSE _timescaledb_internal.time_to_internal(range_start::bigint)
        END
       END AS range_start,
       CASE 
       WHEN column_type = 'timestamptz'::regtype THEN
            _timescaledb_internal.time_to_internal(range_end::timestamptz)
       WHEN column_type = 'timestamp'::regtype THEN
            _timescaledb_internal.time_to_internal(range_end::timestamp)
       WHEN column_type = 'date'::regtype THEN
            _timescaledb_internal.time_to_internal(range_end::date)
       ELSE
            CASE WHEN range_end IS NULL
        THEN 9223372036854775807
        ELSE _timescaledb_internal.time_to_internal(range_end::bigint)
        END
       END AS range_end
  FROM unparsed_missing_slices;
$BODY$;

Update: Making some changes to repair script above so that it by default tries to convert to bigint and only handles timestamps differently.

akamensky commented 4 years ago

ahh, thanks, let me try that out tomorrow.

akamensky commented 4 years ago

@mkindahl Does not seem to work:

dbname=# call repair_dimension_slice();
ERROR:  null value in column "range_start" violates not-null constraint
DETAIL:  Failing row contains (21303, 26, null, null).
CONTEXT:  SQL function "repair_dimension_slice" statement 1

Edited the script above so that it by default tries to convert to bigint. Is likely going to avoid the bad NULL.

akamensky commented 4 years ago

@mkindahl we've rebuilt the database in our staging environment with to make sure any leftover missing relations are gone. In 24 hours we again get missing relations:

dbname=# SELECT chunk_id
     , dimension_slice_id
     , constraint_name
     , pg_get_expr(conbin,conrelid)
  FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
  JOIN pg_constraint ON constraint_name = conname
 WHERE sl.id IS NULL;
 chunk_id | dimension_slice_id | constraint_name |                                                                      pg_get_expr                                                                       
----------+--------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------
     7647 |               3001 | constraint_3001 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
     7651 |               2999 | constraint_2999 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
     7653 |               3008 | constraint_3008 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
     7657 |               3004 | constraint_3004 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
     7649 |               3002 | constraint_3002 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
     7650 |               2997 | constraint_2997 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
     7654 |               3007 | constraint_3007 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
     7656 |               3005 | constraint_3005 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
(8 rows)

The above is with 1.7.2. Which means the issue still exists there.

akamensky commented 4 years ago

@mkindahl we also get another crash when calling TRUNCATE or DROP TABLE on table that seems to have those missing relations:

gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1597213605.1616
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 1616]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres dbname [local] TRUNCATE TABLE  '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#1  0x00007fa4c0750a2c in ts_chunk_delete_by_hypertable_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#2  0x00007fa4c07702b4 in process_truncate () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#3  0x00007fa4c076f862 in timescaledb_ddl_command_start () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#4  0x00000000007632d6 in PortalRunUtility ()
#5  0x0000000000763d27 in PortalRunMulti ()
#6  0x0000000000764905 in PortalRun ()
#7  0x0000000000760af5 in exec_simple_query ()
#8  0x0000000000761d92 in PostgresMain ()
#9  0x0000000000484022 in ServerLoop ()
#10 0x00000000006f14c3 in PostmasterMain ()
#11 0x0000000000484f23 in main ()
(gdb)

Not sure if this is related, please advise.

mkindahl commented 4 years ago

@akamensky It seems related. I suspect that there is a race between drop_chunks calls as well. Do I understand it correctly in that you are running drop_chunks in parallel with TRUNCATE TABLE and/or DROP TABLE on the hypertable that you're running drop_chunks on?

mkindahl commented 4 years ago

@mkindahl Does not seem to work:

dbname=# call repair_dimension_slice();
ERROR:  null value in column "range_start" violates not-null constraint
DETAIL:  Failing row contains (21303, 26, null, null).
CONTEXT:  SQL function "repair_dimension_slice" statement 1

Yeah, range_start should not be NULL, so not surprising that it fails. Could you add the result of running the missing_slices query above?

mkindahl commented 4 years ago

@mkindahl we also get another crash when calling TRUNCATE or DROP TABLE on table that seems to have those missing relations:

gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1597213605.1616
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 1616]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres dbname [local] TRUNCATE TABLE  '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x00007fa4c07506b3 in chunk_delete () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#1  0x00007fa4c0750a2c in ts_chunk_delete_by_hypertable_id () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#2  0x00007fa4c07702b4 in process_truncate () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#3  0x00007fa4c076f862 in timescaledb_ddl_command_start () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#4  0x00000000007632d6 in PortalRunUtility ()
#5  0x0000000000763d27 in PortalRunMulti ()
#6  0x0000000000764905 in PortalRun ()
#7  0x0000000000760af5 in exec_simple_query ()
#8  0x0000000000761d92 in PostgresMain ()
#9  0x0000000000484022 in ServerLoop ()
#10 0x00000000006f14c3 in PostmasterMain ()
#11 0x0000000000484f23 in main ()
(gdb)

Not sure if this is related, please advise.

Looking closer at this one, it seems to be a consequence of the missing dimension slice, not a direct bug. I can easily reproduce it by manually dropping the dimension slices though and it is obviously triggered because the dimension slice cannot be found.

(gdb) l
2655                                            .lockmode = LockTupleExclusive,
2656                                            .waitpolicy = LockWaitBlock,
2657                                    };
2658                                    DimensionSlice *slice =
2659                                            ts_dimension_slice_scan_by_id_and_lock(cc->fd.dimension_slice_id,
2660                                                                                                                       &tuplock,
2661                                                                                                                       CurrentMemoryContext);
2662                                    if (ts_chunk_constraint_scan_by_dimension_slice_id(slice->fd.id,
2663                                                                                                                                       NULL,
2664                                                                                                                                       CurrentMemoryContext) == 0)
(gdb) p slice
$2 = (DimensionSlice *) 0x0

akamensky commented 4 years ago

@mkindahl On a completely clean rebuild of Timescale using 1.7.2 -- the issue did not happen for a few days at first, but came back with same stacktrace as above. There were no missing slices (as completely new rebuild). Once the error returned there are:

db_name=# SELECT hypertable_id,
db_name-#            chunk_id,
db_name-#      dimension_slice_id,
db_name-#      constraint_name,
db_name-#      attname AS column_name,
db_name-#      pg_get_expr(conbin, conrelid) AS constraint_expr
db_name-#       FROM _timescaledb_catalog.chunk_constraint cc
db_name-#       JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id
db_name-#       JOIN pg_constraint ON conname = constraint_name
db_name-#       JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name
db_name-#       JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid
db_name-#       WHERE
db_name-#  dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice);
 hypertable_id | chunk_id | dimension_slice_id | constraint_name | column_name  |                                                                    constraint_expr                                                                

---------------+----------+--------------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------
-----
             3 |     9103 |               3186 | constraint_3186 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
             3 |     9104 |               3183 | constraint_3183 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
             7 |     9105 |               3185 | constraint_3185 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
             3 |     9106 |               3182 | constraint_3182 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
             7 |     9107 |               3184 | constraint_3184 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
             3 |     9108 |               3188 | constraint_3188 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
             7 |     9109 |               3180 | constraint_3180 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
             7 |     9110 |               3187 | constraint_3187 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
(8 rows)

db_name=#

It does look that the issue is still there.

Do I understand it correctly in that you are running drop_chunks in parallel with TRUNCATE TABLE and/or DROP TABLE on the hypertable that you're running drop_chunks on?

No. It is running truncate/drop on table outside of running drop_chunks. But at the time when error is present (thus missing relations), which would of course be there, because truncate or drop will attempt to delete them.

mkindahl commented 4 years ago

@mkindahl On a completely clean rebuild of Timescale using 1.7.2 -- the issue did not happen for a few days at first, but came back with same stacktrace as above. There were no missing slices (as completely new rebuild). Once the error returned there are:

db_name=# SELECT hypertable_id,
db_name-#            chunk_id,
db_name-#      dimension_slice_id,
db_name-#      constraint_name,
db_name-#      attname AS column_name,
db_name-#      pg_get_expr(conbin, conrelid) AS constraint_expr
db_name-#       FROM _timescaledb_catalog.chunk_constraint cc
db_name-#       JOIN _timescaledb_catalog.chunk ch ON cc.chunk_id = ch.id
db_name-#       JOIN pg_constraint ON conname = constraint_name
db_name-#       JOIN pg_namespace ns ON connamespace = ns.oid AND ns.nspname = ch.schema_name
db_name-#       JOIN pg_attribute ON attnum = conkey[1] AND attrelid = conrelid
db_name-#       WHERE
db_name-#  dimension_slice_id NOT IN (SELECT id FROM _timescaledb_catalog.dimension_slice);
 hypertable_id | chunk_id | dimension_slice_id | constraint_name | column_name  |                                                                    constraint_expr                                                                

---------------+----------+--------------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------
-----
             3 |     9103 |               3186 | constraint_3186 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
             3 |     9104 |               3183 | constraint_3183 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
             7 |     9105 |               3185 | constraint_3185 | col_name | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
             3 |     9106 |               3182 | constraint_3182 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
             7 |     9107 |               3184 | constraint_3184 | col_name | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
             3 |     9108 |               3188 | constraint_3188 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
             7 |     9109 |               3180 | constraint_3180 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 16106127
33))
             7 |     9110 |               3187 | constraint_3187 | col_name | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 107374182
2))
(8 rows)

db_name=#

It does look that the issue is still there.

And this is with concurrent INSERT being run? Not with concurrent drop_chunks calls?

akamensky commented 4 years ago

And this is with concurrent INSERT being run? Not with concurrent drop_chunks calls?

Yes, we changed not that long ago from running drop_chunks on multiple databases (1 call per database) from parallel to sequential (so it would wait until first drop_chunks succeeded OR failed before starting next one with delay). But that did not fix the issue. Meantime concurrent INSERT are present in both cases. Thus single drop_chunks + parallel high rate of INSERT appears to be a common denominator here.

UPD: internally we concluded that slow underlying storage seems to increase likelihood of this issue happening. Our staging environment is on relatively slow disks, and this issue would reappear there much faster than on other environments that use much faster storage.

mkindahl commented 4 years ago

And this is with concurrent INSERT being run? Not with concurrent drop_chunks calls?

Yes, we changed not that long ago from running drop_chunks on multiple databases (1 call per database) from parallel to sequential (so it would wait until first drop_chunks succeeded OR failed before starting next one with delay). But that did not fix the issue. Meantime concurrent INSERT are present in both cases. Thus single drop_chunks + parallel high rate of INSERT appears to be a common denominator here.

UPD: internally we concluded that slow underlying storage seems to increase likelihood of this issue happening. Our staging environment is on relatively slow disks, and this issue would reappear there much faster than on other environments that use much faster storage.

This is then likely to be a missing lock still, and it is true that a slow underlying storage would increase the likelihood of a race condition because of a missing lock. We have fixed the function that reads hypercube information from the dimension slice table and ensured that it takes tuple locks, so hopefully we have covered all cases now and it should be available in 1.7.3 (that we're in the process of releasing).

If you can post a message when you've tested this, regardless of whether it works or not, we would be very grateful. I'm keeping the bug closed for now, but we will re-open it if it turns out that there are still lingering issues.

akamensky commented 4 years ago

Thanks @mkindahl I've seen the release. We are going to upgrade staging and uat environments to 1.7.3 today and will observe for next week. Before it took a week or two to get the error back, so please allow some time. Once we observe for enough time I will confirm here. Thanks.

akamensky commented 4 years ago

@mkindahl Just upgraded to 1.7.3, did not rebuild the data (assumed that the segfault itself would be fixed in this version). But still no:

[544437.027061] postmaster[25376]: segfault at 4 ip 00007f95f577eb69 sp 00007ffe6c4f6c28 error 4 in timescaledb-1.7.3.so[7f95f5750000+6a000]

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 25376]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres db_name [local] SELECT          '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#1  0x00000000008bd8cd in pg_qsort ()
#2  0x00007f95f577ee52 in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#3  0x00007f95f576b041 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#4  0x00007f95f576b103 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#5  0x00007f95f578f6c9 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#6  0x00007f95f576ad83 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#7  0x00007f95f576ae3f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#8  0x00007f95f576ac65 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#9  0x00007f95f576d4f9 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#10 0x00007f95f576ed1e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#11 0x00007f95f576f559 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()

That is with missing slices as caused by 1.7.2. We will rebuild the data, and see whether slices would go missing still.

akamensky commented 4 years ago

With rebuilt data we did not get any missing relations / segfaults yet. But I see in logs following errors:

2020-09-01 13:00:01.553 TZ [8705] ERROR:  query returned no rows
2020-09-01 13:00:01.553 TZ [8705] CONTEXT:  PL/pgSQL function _timescaledb_internal.dimension_slice_get_constraint_sql(integer) line 9 at SQL statement
    PL/pgSQL function _timescaledb_internal.chunk_constraint_add_table_constraint(_timescaledb_catalog.chunk_constraint) line 15 at assignment
....
< followed by a few INSERT queries into hypertables >

and

2020-09-01 09:00:01.706 TZ [6160] ERROR:  could not open relation with OID 2912420
...
< followed by insert statement into one of hypertables >

also

2020-09-01 16:00:03.501 TZ [45626] ERROR:  deadlock detected
2020-09-01 16:00:03.501 TZ [45626] DETAIL:  Process 45626 waits for RowExclusiveLock on relation 2918460 of database 2887017; blocked by process 48154.
    Process 48154 waits for AccessExclusiveLock on relation 2918478 of database 2887017; blocked by process 45626.
...
< followed by insert statement into one of hypertables >

These are all at the times when drop_chunks is being called.

akamensky commented 4 years ago

@mkindahl 1.7.3 did not resolve the issue:

After a few days of no segfault we get it again:

[Sep 3 14:59] postmaster[129640]: segfault at 4 ip 00007f95f577eb69 sp 00007ffe6c4f6c28 error 4 in timescaledb-1.7.3.so[7f95f5750000+6a000]

Stack trace from core dump of this segfault:

gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1599116401.129640 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 129640]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres db_name [local] SELECT          '.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x00007f95f577eb69 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#1  0x00000000008bd8cd in pg_qsort ()
#2  0x00007f95f577ee52 in ts_hypercube_from_constraints () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#3  0x00007f95f576b041 in chunk_build_from_tuple_and_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#4  0x00007f95f576b103 in chunk_tuple_found () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#5  0x00007f95f578f6c9 in ts_scanner_scan () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#6  0x00007f95f576ad83 in chunk_create_from_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#7  0x00007f95f576ae3f in chunk_scan_context_add_chunk () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#8  0x00007f95f576ac65 in chunk_scan_ctx_foreach_chunk_stub () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#9  0x00007f95f576d4f9 in ts_chunk_get_chunks_in_time_range () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#10 0x00007f95f576ed1e in ts_chunk_do_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#11 0x00007f95f576f559 in ts_chunk_drop_chunks () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#12 0x000000000061d948 in ExecMakeFunctionResultSet ()
#13 0x000000000063b0c3 in ExecProjectSRF ()
#14 0x000000000063b1c5 in ExecProjectSet ()
#15 0x0000000000614b62 in standard_ExecutorRun ()
#16 0x000000000076366b in PortalRunSelect ()
#17 0x0000000000764a0f in PortalRun ()
#18 0x0000000000760af5 in exec_simple_query ()
#19 0x0000000000761d92 in PostgresMain ()
#20 0x0000000000484022 in ServerLoop ()
#21 0x00000000006f14c3 in PostmasterMain ()
#22 0x0000000000484f23 in main ()
(gdb)

From the log:

2020-09-03 15:00:01.931 HKT [129640] LOG:  connection authorized: user=postgres database=db_name application_name=psql
2020-09-03 15:00:01.966 HKT [22475] LOG:  server process (PID 129640) was terminated by signal 11: Segmentation fault
2020-09-03 15:00:01.966 HKT [22475] DETAIL:  Failed process was running: SELECT drop_chunks(interval '1 hours');
2020-09-03 15:00:01.966 HKT [22475] LOG:  terminating any other active server processes

From psql:

postgres=# \c db_name
You are now connected to database "db_name" as user "postgres".
db_name=# \dx
                                      List of installed extensions
    Name     | Version |   Schema   |                            Description                            
-------------+---------+------------+-------------------------------------------------------------------
 plpgsql     | 1.0     | pg_catalog | PL/pgSQL procedural language
 timescaledb | 1.7.3   | public     | Enables scalable inserts and complex queries for time-series data
(2 rows)

db_name=# SELECT chunk_id
db_name-#      , dimension_slice_id
db_name-#      , constraint_name
db_name-#      , pg_get_expr(conbin,conrelid)
db_name-#   FROM _timescaledb_catalog.chunk_constraint LEFT JOIN _timescaledb_catalog.dimension_slice sl ON dimension_slice_id = sl.id
db_name-#   JOIN pg_constraint ON constraint_name = conname
db_name-#  WHERE sl.id IS NULL;
 chunk_id | dimension_slice_id | constraint_name |                                                                      pg_get_expr                                                                       
----------+--------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------
     7857 |               3695 | constraint_3695 | (_timescaledb_internal.get_partition_hash("col_name") < 536870911)
     7861 |               3699 | constraint_3699 | (_timescaledb_internal.get_partition_hash("col_name") >= 1610612733)
     7867 |               3690 | constraint_3690 | ((_timescaledb_internal.get_partition_hash("col_name") >= 1073741822) AND (_timescaledb_internal.get_partition_hash("col_name") < 1610612733))
     7870 |               3686 | constraint_3686 | ((_timescaledb_internal.get_partition_hash("col_name") >= 536870911) AND (_timescaledb_internal.get_partition_hash("col_name") < 1073741822))
(4 rows)

db_name=#

mkindahl commented 4 years ago

@akamensky Thanks for the information. Reopening.

timescale / timescaledb

Segfault when drop_chunks with 1.7.1 #1986