timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
16.94k stars 858 forks source link

Segfault in get_aggsplit #3672

Closed phemmer closed 2 years ago

phemmer commented 2 years ago

Relevant system information:

Describe the bug timescaledb segfaults during a query

To Reproduce Unsure. Sometimes it works, sometimes (a lot of times) it doesn't. All of a sudden started without obvious explanation. Does it on multiple different queries.

The segfault is on the access node of a multi-node cluster.

Here's one example query, one similar to the gdb output below (just different values):

select
  time_bucket('2s',"time") AS "time",
  host,
  sum("errors") / nullif(sum("http_response"),0)
FROM haproxy_agg_instance_host_1m
WHERE
  "time" BETWEEN '2021-10-08T17:06:48.641Z' AND '2021-10-08T18:06:48.641Z'
  AND instance in ('haproxy-1234')
  AND host in ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32')
GROUP BY "time", host

Expected behavior no segfault

Actual behavior segfault

Screenshots

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe2e2cb6f87 in get_aggsplit (rel=0x7fe2e2b20188) at /src/tsl/src/fdw/estimate.c:51
51          if (IsA(expr, Aggref))
(gdb) bt full
#0  0x00007fe2e2cb6f87 in get_aggsplit (rel=0x7fe2e2b20188) at /src/tsl/src/fdw/estimate.c:51
        expr = 0x0
        lc__state = {l = <optimized out>, i = 0}
        lc = 0x7fe2e2b1f030
        lc = <optimized out>
        lc__state = <optimized out>
        expr = <optimized out>
#1  get_upper_rel_estimate (ce=<synthetic pointer>, rel=0x7fe2e2b20188, root=0x56410baedfe8) at /src/tsl/src/fdw/estimate.c:101
        aggsplit = <optimized out>
        fpinfo = 0x56410bc3e9d8
        ofpinfo = 0x56410bc3eae8
        ptarget = 0x7fe2e2b1eda8
        input_rows = 70
        num_group_cols = <optimized out>
        aggcosts = {numAggs = 0, numOrderedAggs = 0, hasNonPartial = false, hasNonSerial = false, transCost = {startup = 0, per_tuple = 0}, finalCost = {startup = 0, per_tuple = 0}, transitionSpace = 0}
        num_groups = 1
        fpinfo = <optimized out>
        ofpinfo = <optimized out>
        ptarget = <optimized out>
        aggcosts = <optimized out>
        input_rows = <optimized out>
        num_group_cols = <optimized out>
        num_groups = <optimized out>
        _vstart = <optimized out>
        _val = <optimized out>
        _len = <optimized out>
        _start = <optimized out>
        _stop = <optimized out>
        aggsplit = <optimized out>
#2  fdw_estimate_path_cost_size (root=root@entry=0x56410baedfe8, rel=rel@entry=0x7fe2e2b20188, pathkeys=pathkeys@entry=0x0, p_rows=p_rows@entry=0x7ffcc5706f50, p_width=p_width@entry=0x7ffcc5706f4c, 
    p_startup_cost=p_startup_cost@entry=0x7ffcc5706f58, p_total_cost=0x7ffcc5706f60) at /src/tsl/src/fdw/estimate.c:230
        fpinfo = 0x56410bc3e9d8
        ce = {rows = <optimized out>, retrieved_rows = <optimized out>, width = 104, startup_cost = <optimized out>, total_cost = <optimized out>, cpu_per_tuple = <optimized out>, run_cost = <optimized out>}
        __func__ = "fdw_estimate_path_cost_size"
#3  0x00007fe2e2cb92b4 in add_foreign_grouping_paths (create_path=0x7fe2e2cbee00 <data_node_scan_upper_path_create>, extra=<optimized out>, grouped_rel=0x4, input_rel=<optimized out>, root=0x56410baedfe8)
    at /src/tsl/src/fdw/scan_plan.c:628
        parse = <optimized out>
        ifpinfo = <optimized out>
        fpinfo = 0x56410bc3e9d8
        total_cost = 6.9526706677819959e-310
        grouppath = <optimized out>
        rows = 6.9471777292701308e-310
        width = 32738
        startup_cost = -8544742183141376
        parse = <optimized out>
        ifpinfo = <optimized out>
        fpinfo = <optimized out>
        grouppath = <optimized out>
        rows = <optimized out>
        width = <optimized out>
        startup_cost = <optimized out>
        total_cost = <optimized out>
#4  fdw_create_upper_paths (input_fpinfo=<optimized out>, root=root@entry=0x56410baedfe8, stage=stage@entry=UPPERREL_GROUP_AGG, input_rel=input_rel@entry=0x56410bbcd5b8, output_rel=output_rel@entry=0x7fe2e2b20188, 
    extra=extra@entry=0x7ffcc5707250, create_path=0x7fe2e2cbee00 <data_node_scan_upper_path_create>) at /src/tsl/src/fdw/scan_plan.c:680
No locals.
#5  0x00007fe2e2cbfc93 in data_node_scan_create_upper_paths (root=0x56410baedfe8, stage=UPPERREL_GROUP_AGG, input_rel=0x56410bbcd5b8, output_rel=0x7fe2e2b20188, extra=0x7ffcc5707250)
    at /src/tsl/src/fdw/data_node_scan_plan.c:506
        rel_private = <optimized out>
        fpinfo = <optimized out>
#6  0x000056410991f47b in create_ordinary_grouping_paths (root=<optimized out>, input_rel=0x56410bbcd5b8, grouped_rel=0x7fe2e2b20188, agg_costs=0x7ffcc5707450, gd=0x0, extra=<optimized out>, 
    partially_grouped_rel_p=0x7ffcc5707248) at ./build/../src/backend/optimizer/plan/planner.c:4171
        cheapest_path = <optimized out>
        partially_grouped_rel = <optimized out>
        dNumGroups = 70
        patype = <optimized out>
        __func__ = "create_ordinary_grouping_paths"
#7  0x0000564109920782 in create_partitionwise_grouping_paths (input_rel=0x56410bb61a18, input_rel=0x56410bb61a18, extra=0x7ffcc5707490, patype=<optimized out>, gd=0x0, agg_costs=0x7ffcc5707450, 
    partially_grouped_rel=<optimized out>, grouped_rel=0x7fe2e2b1eaa8, root=0x56410baedfe8) at ./build/../src/backend/optimizer/plan/planner.c:7657
        child_input_rel = 0x56410bbcd5b8
        child_target = <optimized out>
        appinfos = 0x7fe2e2b1db90
        child_extra = {flags = 7, partial_costs_set = false, agg_partial_costs = {numAggs = -762970112, numOrderedAggs = -1019323547, hasNonPartial = 104, hasNonSerial = 122, transCost = {startup = 0, 
              per_tuple = 6.9526706678563033e-310}, finalCost = {startup = 4.6855867561413431e-310, per_tuple = 4.6855885748251472e-310}, transitionSpace = 0}, agg_final_costs = {numAggs = -982485776, 
            numOrderedAggs = 32764, hasNonPartial = 16, hasNonSerial = 171, transCost = {startup = 4.1720347046589122e-309, per_tuple = 2.0158960014170087e-311}, finalCost = {startup = 4.6855869120163864e-310, 
              per_tuple = 1.3907136014837293e-309}, transitionSpace = 94837369346306}, target_parallel_safe = true, havingQual = 0x0, targetList = 0x7fe2e2b1f938, patype = PARTITIONWISE_AGGREGATE_FULL}
        child_grouped_rel = 0x7fe2e2b20188
        nappinfos = 1
        child_partially_grouped_rel = 0x0
        nparts = <optimized out>
        grouped_live_children = 0x0
        partial_grouping_valid = true
        cnt_parts = <optimized out>
        partially_grouped_live_children = 0x0
        target = 0x7fe2e2b1d9e0
        nparts = <optimized out>
        cnt_parts = <optimized out>
        grouped_live_children = <optimized out>
        partially_grouped_live_children = <optimized out>
        target = <optimized out>
        partial_grouping_valid = <optimized out>
        child_input_rel = <optimized out>
        child_target = <optimized out>
        appinfos = <optimized out>
        nappinfos = <optimized out>
        child_extra = <optimized out>
        child_grouped_rel = <optimized out>
        child_partially_grouped_rel = <optimized out>
#8  create_ordinary_grouping_paths (root=<optimized out>, input_rel=0x56410bb61a18, grouped_rel=0x7fe2e2b1eaa8, agg_costs=0x7ffcc5707450, gd=0x0, extra=<optimized out>, partially_grouped_rel_p=0x7ffcc5707438)
    at ./build/../src/backend/optimizer/plan/planner.c:4123
        cheapest_path = 0x7fe2e2b1e7a8
        partially_grouped_rel = <optimized out>
        dNumGroups = <optimized out>
        patype = <optimized out>
        __func__ = "create_ordinary_grouping_paths"
#9  0x000056410992317a in create_grouping_paths (gd=0x0, agg_costs=0x7ffcc5707450, target_parallel_safe=true, target=0x7fe2e2b1d9e0, input_rel=0x56410bb61a18, root=0x56410baedfe8)
    at ./build/../src/backend/optimizer/plan/planner.c:3896
        flags = 7
        extra = {flags = 7, partial_costs_set = false, agg_partial_costs = {numAggs = -762970112, numOrderedAggs = -1019323547, hasNonPartial = 104, hasNonSerial = 122, transCost = {startup = 0, 
              per_tuple = 6.9526706678563033e-310}, finalCost = {startup = 4.6855867561413431e-310, per_tuple = 4.6855885748251472e-310}, transitionSpace = 0}, agg_final_costs = {numAggs = -982485776, 
            numOrderedAggs = 32764, hasNonPartial = 16, hasNonSerial = 171, transCost = {startup = 4.1720347046589122e-309, per_tuple = 2.0158960014170087e-311}, finalCost = {startup = 4.6855869120163864e-310, 
              per_tuple = 1.3907136014837293e-309}, transitionSpace = 94837369346306}, target_parallel_safe = true, havingQual = 0x0, targetList = 0x56410bb5ff60, patype = PARTITIONWISE_AGGREGATE_FULL}
        parse = 0x56410ba823d0
        grouped_rel = 0x7fe2e2b1eaa8
        partially_grouped_rel = 0x0
        parse = <optimized out>
        grouped_rel = <optimized out>
        partially_grouped_rel = <optimized out>
        flags = <optimized out>
        extra = <optimized out>
#10 grouping_planner (root=<optimized out>, inheritance_update=false, tuple_fraction=<optimized out>) at ./build/../src/backend/optimizer/plan/planner.c:2201
        sort_input_targets = 0x0
        sort_input_target_parallel_safe = true
        grouping_target = 0x7fe2e2b1d9e0
        scanjoin_target = <optimized out>
        activeWindows = <optimized out>
        qp_extra = {activeWindows = 0x0, groupClause = 0x56410baec968}
        sort_input_targets_contain_srfs = 0x0
        have_grouping = <optimized out>
        wflists = <optimized out>
        gset_data = 0x0
        sort_input_target = 0x7fe2e2b1d9e0
        grouping_targets = 0x0
        grouping_target_parallel_safe = true
        scanjoin_targets = 0x7fe2e2b1daf0
        scanjoin_target_parallel_safe = <optimized out>
        grouping_targets_contain_srfs = 0x0
        scanjoin_targets_contain_srfs = 0x0
        scanjoin_target_same_exprs = <optimized out>
        agg_costs = {numAggs = 2, numOrderedAggs = 0, hasNonPartial = false, hasNonSerial = false, transCost = {startup = 0, per_tuple = 0.0050000000000000001}, finalCost = {startup = 0, per_tuple = 0.0050000000000000001}, transitionSpace = 256}
        parse = 0x56410ba823d0
        offset_est = 0
        count_est = 0
        limit_tuples = -1
        have_postponed_srfs = false
        final_target = 0x7fe2e2b1d9e0
        final_targets = 0x0
        final_targets_contain_srfs = 0x0
        final_target_parallel_safe = true
        current_rel = 0x56410bb61a18
        final_rel = <optimized out>
        extra = {limit_needed = 7, limit_tuples = -8544742183141376, count_est = 140612445698664, offset_est = 0}
        lc = <optimized out>
        __func__ = "grouping_planner"
#11 0x00005641099246a1 in subquery_planner (glob=glob@entry=0x56410baef4a8, parse=parse@entry=0x56410ba823d0, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0) at ./build/../src/backend/optimizer/plan/planner.c:1015
        root = <optimized out>
        newWithCheckOptions = <optimized out>
        newHaving = <optimized out>
        hasOuterJoins = <optimized out>
        hasResultRTEs = <optimized out>
        final_rel = <optimized out>
        l = <optimized out>
#12 0x0000564109925895 in standard_planner (parse=0x56410ba823d0, query_string=<optimized out>, cursorOptions=256, boundParams=<optimized out>) at ./build/../src/backend/optimizer/plan/planner.c:405
        result = <optimized out>
        glob = 0x56410baef4a8
        tuple_fraction = 0
        root = <optimized out>
        final_rel = <optimized out>
        best_path = <optimized out>
        top_plan = <optimized out>
        lp = <optimized out>
        lr = <optimized out>
#13 0x00007fe362944ac5 in pgss_planner (parse=0x56410ba823d0, query_string=0x56410b96dfd8 "select\n  time_bucket('30s',\"time\") AS \"time\",\n  host,\n  sum(\"errors\") / nullif(sum(\"http_response\"),0)\nFROM haproxy_agg_instance_host_1m\nWHERE\n  \"time\" BETWEEN '2021-10-08T11:38:28.523Z' AND '2021-10-"..., cursorOptions=256, boundParams=0x0) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:991
        result = <optimized out>
#14 0x00007fe36293535f in ?? ()
No symbol table info available.
#15 0x000100020000003e in ?? ()
No symbol table info available.
#16 0x00007fe2e2dd9102 in ?? ()
No symbol table info available.
#17 0x0000000000000000 in ?? ()
No symbol table info available.

Additional context Add any other context about the problem here.

mkindahl commented 2 years ago

@phemmer I assume that haproxy_agg_instance_host_1m is a continuous aggregate. Could you add the definition of the continuous aggregate and the associated hypertable?

phemmer commented 2 years ago

No, it's not a continuous aggregate, but it is a view. However even if I query the underlying table directly (substitute haproxy_agg_instance_host_1m with the underlying table name _haproxy_agg_instance_host_1m), the issue still occurs.

Below is the underlying table:

# \d+ _haproxy_agg_instance_host_1m
                                     Table "public._haproxy_agg_instance_host_1m"
    Column     |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description 
---------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 time          | timestamp without time zone |           | not null |         | plain    |              | 
 instance      | text                        |           |          |         | extended |              | 
 host          | text                        |           |          |         | extended |              | 
 http_response | numeric                     |           |          |         | main     |              | 
 bin           | numeric                     |           |          |         | main     |              | 
 bout          | numeric                     |           |          |         | main     |              | 
 errors        | numeric                     |           |          |         | main     |              | 
Indexes:
    "_haproxy_agg_instance_host_1m_host_time_idx" btree (host, "time" DESC)
    "_haproxy_agg_instance_host_1m_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON _haproxy_agg_instance_host_1m FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Child tables: _timescaledb_internal._dist_hyper_3191_11531_chunk,
              _timescaledb_internal._dist_hyper_3191_11532_chunk,
              _timescaledb_internal._dist_hyper_3191_11554_chunk,
...
phemmer commented 2 years ago

I'm able to reliably reproduce it (at least on my cluster). I don't know how much of the below is actually necessary, but here it is:

create table test (time timestamp, instance text, host text, http_response bigint, bin bigint, bout bigint, errors bigint);
create index on test (host, time desc);
select create_distributed_hypertable('test',
    'time',
    partitioning_column => 'host',
    number_partitions => 2,
    replication_factor => 2,
    chunk_time_interval => interval '1h');
alter table test set (timescaledb.compress, timescaledb.compress_segmentby = 'host');
insert into test values
                     ('2021-10-15T00:00:00','a','a',1,2,3,4),
                     ('2021-10-15T00:00:00','a','b',1,2,3,4),
                     ('2021-10-15T00:00:00','a','c',1,2,3,4),
                     ('2021-10-15T00:00:00','a','d',1,2,3,4),
                     ('2021-10-15T01:00:00','a','a',1,2,3,4),
                     ('2021-10-15T01:00:00','a','b',1,2,3,4),
                     ('2021-10-15T01:00:00','a','c',1,2,3,4),
                     ('2021-10-15T01:00:00','a','d',1,2,3,4),
                     ('2021-10-15T02:00:00','a','a',1,2,3,4),
                     ('2021-10-15T02:00:00','a','b',1,2,3,4),
                     ('2021-10-15T02:00:00','a','c',1,2,3,4),
                     ('2021-10-15T02:00:00','a','d',1,2,3,4),
                     ('2021-10-15T03:00:00','a','a',1,2,3,4),
                     ('2021-10-15T03:00:00','a','b',1,2,3,4),
                     ('2021-10-15T03:00:00','a','c',1,2,3,4),
                     ('2021-10-15T03:00:00','a','d',1,2,3,4)
;
select compress_chunk(show_chunks) from show_chunks('test', older_than => '2021-10-15T02:00:00'::timestamp);

select
  time_bucket('2s',"time") AS "time",
  host,
  sum("errors") / nullif(sum("http_response"),0)
FROM test
WHERE
  "time" BETWEEN '2021-10-15T00:02:00.000Z' AND '2021-10-15T00:03:00.000Z'
  AND instance in ('a')
  AND host in ('a','b')
GROUP BY "time", host;