timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.49k stars 878 forks source link

[Bug]: INSERT INTO table [...] WHERE NOT EXISTS (SELECT 1 FROM table WHERE some_condition IS TRUE) casuses segfault #6594

Open kimec opened 7 months ago

kimec commented 7 months ago

What type of bug is this?

Crash

What subsystems and features are affected?

Query executor

What happened?

INSERT INTO table [...] WHERE NOT EXISTS (SELECT 1 FROM table WHERE some_condition IS TRUE) causes instant segfault for a partition of certain size (likely fully occupied). The affected table is chunked on scalar BIGSRIAL with chunk_time_interval => 25000000

TimescaleDB version affected

2.13.1

PostgreSQL version used

15.5

What operating system did you use?

CentOS 7.9.2009

What installation method did you use?

RPM

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

#0  pg_detoast_datum_packed (datum=datum@entry=0xfffffffffffffffd) at fmgr.c:1742
#1  0x00000000008daef2 in text_to_cstring (t=0xfffffffffffffffd) at varlena.c:224
#2  0x000000000090b890 in FunctionCall1Coll (flinfo=0x7ffcfc56ee30, collation=<optimized out>, arg1=<optimized out>) at fmgr.c:1124
#3  0x000000000090c8b6 in OutputFunctionCall (val=18446744073709551613, flinfo=0x7ffcfc56ee30) at fmgr.c:1561
#4  OidOutputFunctionCall (functionId=<optimized out>, val=18446744073709551613) at fmgr.c:1644
#5  0x000000000065c97c in ExecBuildSlotValueDescription (reloid=26937, slot=slot@entry=0x2f05a58, tupdesc=tupdesc@entry=0x7fbc57f04738, modifiedCols=0x2f05e78, maxfieldlen=64) at execMain.c:2294
#6  0x000000000065e83d in ExecConstraints (resultRelInfo=resultRelInfo@entry=0x2ea5728, slot=slot@entry=0x2f05a58, estate=estate@entry=0x2e46878) at execMain.c:1962
#7  0x00007fbc587221f8 in ExecInsert (context=context@entry=0x7ffcfc56f170, resultRelInfo=0x2ea5728, slot=0x2f05a58, canSetTag=<optimized out>)
    at /home/builder/rpmbuild/BUILD/timescaledb/src/nodes/hypertable_modify.c:1722
#8  0x00007fbc58722da5 in ExecModifyTable (pstate=0x2e47c08, cs_node=0x2e477e8) at /home/builder/rpmbuild/BUILD/timescaledb/src/nodes/hypertable_modify.c:1021
#9  hypertable_modify_exec (node=0x2e477e8) at /home/builder/rpmbuild/BUILD/timescaledb/src/nodes/hypertable_modify.c:182
#10 0x000000000065cb52 in ExecProcNode (node=0x2e477e8) at ../../../src/include/executor/executor.h:262
#11 ExecutePlan (execute_once=<optimized out>, dest=0x7fbccc7f6540, direction=<optimized out>, numberTuples=0, sendTuples=false, operation=CMD_INSERT, use_parallel_mode=<optimized out>,
    planstate=0x2e477e8, estate=0x2e46878) at execMain.c:1636
#12 standard_ExecutorRun (queryDesc=0x2e48888, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#13 0x00007fbcbd75467d in pgss_ExecutorRun (queryDesc=0x2e48888, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1010
#14 0x00000000007d1b0a in ProcessQuery (plan=<optimized out>,
    sourceText=0x2cc62e8 "INSERT INTO table (column1, column2, column3, column4, column5, column6, column7, column8, column9, column10, column11, column12, column13, column14, column15, column16"..., params=0x0, queryEnv=0x0, dest=0x7fbccc7f6540, qc=0x7ffcfc56f5e0) at pquery.c:160
#15 0x00000000007d2595 in PortalRunMulti (portal=portal@entry=0x2d4ac08, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7fbccc7f6540,
    altdest=altdest@entry=0x7fbccc7f6540, qc=qc@entry=0x7ffcfc56f5e0) at pquery.c:1277
#16 0x00000000007d291c in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x7fbccc7f6540, altdest=0x7fbccc7f6540,
    qc=0x7ffcfc56f5e0) at pquery.c:791
#17 0x00000000007ce877 in exec_simple_query (
    query_string=0x2cc62e8 "INSERT INTO table (column1, column2, column3, column4, column5, column6, column7, column8, column9, column10, column11, column12, column13, column14, column15, column16"...) at postgres.c:1250
#18 0x00000000007cee78 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598
#19 0x0000000000493a07 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514
#20 BackendStartup (port=0x2ced7b0) at postmaster.c:4242
#21 ServerLoop () at postmaster.c:1809
#22 0x000000000074cc0d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2cc0d10) at postmaster.c:1481
#23 0x0000000000494876 in main (argc=3, argv=0x2cc0d10) at main.c:202

How can we reproduce the bug?

We can replicate the segfault on two separate servers with identical configuration every time, however, the crash is dependent on partition occupancy. It does not occur if a partition is not fully occupied. We haven't examined the exact occupancy threshold needed to trigger the segfault.

fabriziomello commented 7 months ago

@kimec thanks for reaching out, can u send a complete reproducible SQL script including the complete INSERT statement???

kimec commented 7 months ago

Hello @fabriziomello,

The DDL for the table is

CREATE TABLE public.test_table
(
    id       bigserial NOT NULL,
    id2      int8      NOT NULL,
    id3      text      NOT NULL,
    column1  text      NOT NULL,
    column2  text      NOT NULL,
    column3  numeric   NOT NULL,
    column4  timestamp NOT NULL,
    column5  int8      NOT NULL,
    column6  int8      NOT NULL DEFAULT 0,
    column7  text      NULL,
    column8  text      NULL,
    column9  bool      NOT NULL DEFAULT FALSE,
    column10 text      NOT NULL DEFAULT '-'::text,
    column11 text      NOT NULL DEFAULT '-'::text,
    column12 text      NOT NULL DEFAULT '-'::text,
    column13 bool      NOT NULL DEFAULT FALSE,
    column14 timestamp NOT NULL DEFAULT timezone('UTC'::text, CURRENT_TIMESTAMP),
    column15 numeric   NULL,
    column16 numeric   NULL,
    column17 text      NULL,
    column18 text      NULL,
    column19 text      NULL,
    column20 text      NULL,
    column21 text      NULL,
    column22 numeric   NULL,
    column23 numeric   NULL,
    column24 int4      NULL,
    column25 int4      NULL,
    column26 text      NULL,
    column27 bool      NULL,
    column28 bool      NULL,
    CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX idx_column17 ON public.test_table USING btree (column17);
CREATE INDEX test_table_history_idx ON public.test_table USING btree (column5, id DESC) WHERE (NOT ((column3 = (0)::numeric) OR (column1 = 'XXXXX'::text)));
CREATE INDEX test_table_id2_idx ON public.test_table USING btree (id2);
CREATE INDEX test_table_column5 ON public.test_table USING btree (column5);
CREATE INDEX test_table_id3 ON public.test_table USING btree (id3);

SELECT create_hypertable('test_table', 'id', chunk_time_interval => 25000000);

INSERT statement example is

INSERT INTO test_table (id2, id3, column1, column2, column21, column3, column4, column5, column17, column6,
                        column10, column11, column12, column15, column16, column19, column20, column22, column23,
                        column24, column25, column18, column7, column27, column28)
SELECT
    123,
    'c7aa52f76c522c5137b1927348bb3271',
    'SENTINEL1',
    'aa',
    'aa',
    0,
    '2024-01-17 10:24:02',
    15294609,
    'SENTINEL2',
    -3,
    'bb',
    'cc',
    'dd',
    0.0,
    0.0,
    'ee',
    'ee',
    0,
    0,
    0,
    0,
    NULL,
    NULL,
    'FALSE',
    'TRUE'
WHERE NOT EXISTS (SELECT 1 FROM test_table WHERE column17 = 'SENTINEL2' AND column1 = 'SENTINEL1' AND id >= 1);
kimec commented 7 months ago

I will try to come up with some generator script next week. We did our testing by dumping data from the affected system and restoring it on a separate test system. Unfortunately, the dumped dataset is quite large and contains sensitive data.

fabriziomello commented 7 months ago

@kimec did some tests with your schema and I was not able to reproduce the problem. One potential workaround can be create an unique index and use INSERT ON CONFLICT DO NOTHING.

But anyway would be nice you can end up with a reproducible test case so we can provide a fix.

kimec commented 6 months ago

We tried to create a reproducer but were not able to trigger the segfault on artificially generated data. I should note, the dataset that causes the segfault was migrated from Postgres 10 Timescale 1.6. That being said, we migrated the dataset incrementally upgrading both Postgres and Timescale in steps according the compatibility charts in TS documentation. We will be able to drop the migrated dataset fully in a month or two by which time all our live partitions would be created by Pg 15 and TS 2.13 exclusively. I will get back to you then whether the issue still persists on new partitions. If not, I will close the issue. Since this issue may be related to our specific upgrade path, please treat it with a lower priority.