[Bug]: time_bucket_gapfill with HAVING clause behaving weirdly

fr3aker commented 1 year ago

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Gapfill

What happened?

Using a HAVING clause with time_bucket_gapfill is returning weird (incorrect) results, swallowing expected rows or returning erroneous rows.

TimescaleDB version affected

2.9.1, 2.8.1

PostgreSQL version used

14.5

What operating system did you use?

gentoo x86-64

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

**Test data**

CREATE TABLE metrics(
    time      TIMESTAMP WITH TIME ZONE,
    device_id INT,
    value     DOUBLE PRECISION
);

INSERT INTO metrics
VALUES
('2023-01-03T00:00:00Z', 1, 4),
('2023-01-03T01:00:00Z', 1, 4),
('2023-01-03T02:00:00Z', 1, 4),
('2023-01-05T00:00:00Z', 1, 6);

No HAVING clause working correctly

SELECT
  time_bucket_gapfill('1 day', time) AS day,
  device_id,
  count(*) as count
FROM metrics
WHERE time >= '2023-01-01T00:00:00Z'
  AND time < '2023-01-08T00:00:00Z'
GROUP BY day, device_id;

          day           | device_id | count 
------------------------+-----------+-------
 2023-01-01 01:00:00+01 |         1 |      
 2023-01-02 01:00:00+01 |         1 |      
 2023-01-03 01:00:00+01 |         1 |     3
 2023-01-04 01:00:00+01 |         1 |      
 2023-01-05 01:00:00+01 |         1 |     1
 2023-01-06 01:00:00+01 |         1 |      
 2023-01-07 01:00:00+01 |         1 |
(7 rows)

*HAVING count() IS NULL* or `count() < 1` not returning anything

SELECT
  time_bucket_gapfill('1 day', time) AS day,
  device_id,
  count(*) as count
FROM metrics
WHERE time >= '2023-01-01T00:00:00Z'
  AND time < '2023-01-08T00:00:00Z'
GROUP BY day, device_id
HAVING count(*) < 1;

 day | device_id | count 
-----+-----------+-------
(0 rows)

Expected result

          day           | device_id | count
------------------------+-----------+-------
 2023-01-01 01:00:00+01 |         1 |
 2023-01-02 01:00:00+01 |         1 |
 2023-01-04 01:00:00+01 |         1 |
 2023-01-06 01:00:00+01 |         1 |
 2023-01-07 01:00:00+01 |         1 |
(5 rows)

*HAVING count() < 2** returning erroneous row

SELECT
  time_bucket_gapfill('1 day', time) AS day,
  device_id,
  count(*) as count
FROM metrics
WHERE time >= '2023-01-01T00:00:00Z'
  AND time < '2023-01-08T00:00:00Z'
GROUP BY day, device_id
HAVING count(*) < 2;

          day           | device_id | count
------------------------+-----------+-------
 2023-01-01 01:00:00+01 |         1 |
 2023-01-02 01:00:00+01 |         1 |
 2023-01-03 01:00:00+01 |         1 |
 2023-01-04 01:00:00+01 |         1 |
 2023-01-05 01:00:00+01 |         1 |     1
 2023-01-06 01:00:00+01 |         1 |
 2023-01-07 01:00:00+01 |         1 |
(7 rows)

Expected result

          day           | device_id | count
------------------------+-----------+-------
 2023-01-01 01:00:00+01 |         1 |
 2023-01-02 01:00:00+01 |         1 |
 2023-01-04 01:00:00+01 |         1 |
 2023-01-05 01:00:00+01 |         1 |     1
 2023-01-06 01:00:00+01 |         1 |
 2023-01-07 01:00:00+01 |         1 |
(6 rows)

RafiaSabih commented 1 year ago

Thanks for taking the time to report this. I can easily reproduce the issue with the given steps. We will log this and will see how can we rectify the issue.

RafiaSabih commented 1 year ago

Hello there, on discussing this further within team, we found out that there is a workaround for this, you can reformulate your query like this, WITH gapfill AS ( SELECT time_bucket_gapfill('1 day', time) AS day, device_id, count(*) as count FROM metrics WHERE time >= '2023-01-01T00:00:00Z' AND time < '2023-01-08T00:00:00Z' GROUP BY day, device_id ) SELECT * FROM gapfill WHERE count IS NULL;

Because as per your original query, filtering happens before gapfill runs so if your having filter removes all groups then gapfill will not see any groups explaining the completely empty result and this is only relevant if you have additional grouping columns beside time column. Hopefully this will help you. Please let us know if you need any more inputs here.

fr3aker commented 1 year ago

Thank you very much for the response. The workaround is working well. Is the current behaviour intended or will there be a fix in a future version?

fr3aker commented 1 year ago

Just a note regarding the workaround (which is working!), this seemingly equivalent query is exhibiting the same broken behaviour:

SELECT * FROM ( SELECT time_bucket_gapfill('1 day', time) AS day, device_id, count(*) as count FROM metrics WHERE time >= '2023-01-01T00:00:00Z' AND time < '2023-01-08T00:00:00Z' GROUP BY day, device_id ) AS gapfill WHERE count IS NULL;

Probably has something to do with postgresql's query optimiser.

kgyrtkirk commented 1 year ago

the suggested variant using the WITH clause generates a query plan with CTE (common table expression); meanwhile - your last query uses a subquery.

CTE Scan on gapfill  (cost=35.96..36.12 rows=1 width=20) (actual time=0.048..0.050 rows=5 loops=1)
  Filter: (count IS NULL)
  Rows Removed by Filter: 2
  CTE gapfill
    ->  Custom Scan (GapFill)  (cost=35.94..35.96 rows=8 width=0) (actual time=0.047..0.048 rows=7 loops=1)
          ->  Sort  (cost=35.94..35.96 rows=8 width=0) (actual time=0.045..0.046 rows=2 loops=1)
                Sort Key: metrics.device_id, (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone))
                Sort Method: quicksort  Memory: 25kB
                ->  GroupAggregate  (cost=35.64..35.82 rows=8 width=0) (actual time=0.015..0.016 rows=2 loops=1)
                      Group Key: (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone)), metrics.device_id
                      ->  Sort  (cost=35.64..35.66 rows=8 width=12) (actual time=0.012..0.013 rows=4 loops=1)
                            Sort Key: (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone)), metrics.device_id
                            Sort Method: quicksort  Memory: 25kB
                            ->  Seq Scan on metrics  (cost=0.00..35.52 rows=8 width=12) (actual time=0.003..0.004 rows=4 loops=1)
                                  Filter: (("time" >= '2023-01-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2023-01-08 00:00:00+00'::timestamp with time zone))
Planning Time: 0.223 ms
Execution Time: 0.159 ms

meanwhile the subquery/having plan looks like this:

Custom Scan (GapFill)  (cost=35.81..35.82 rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
  ->  Sort  (cost=35.81..35.82 rows=1 width=0) (actual time=0.042..0.043 rows=0 loops=1)
        Sort Key: metrics.device_id, (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone))
        Sort Method: quicksort  Memory: 25kB
        ->  GroupAggregate  (cost=35.64..35.80 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
              Group Key: (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone)), metrics.device_id
              Filter: (count(*) IS NULL)
              Rows Removed by Filter: 2
              ->  Sort  (cost=35.64..35.66 rows=8 width=12) (actual time=0.013..0.013 rows=4 loops=1)
                    Sort Key: (time_bucket_gapfill('1 day'::interval, metrics."time", NULL::timestamp with time zone, NULL::timestamp with time zone)), metrics.device_id
                    Sort Method: quicksort  Memory: 25kB
                    ->  Seq Scan on metrics  (cost=0.00..35.52 rows=8 width=12) (actual time=0.003..0.004 rows=4 loops=1)
                          Filter: (("time" >= '2023-01-01 00:00:00+00'::timestamp with time zone) AND ("time" < '2023-01-08 00:00:00+00'::timestamp with time zone))
Planning Time: 0.219 ms
Execution Time: 0.143 ms

as of now - this second one is having (:D) trouble in your case - base you want to formulate a condition for the columns populated by the GapFill scan.

I think one way to permanently fix this could be to make the following changes during planning:

remove the Filter from the GroupAggregate
insert a new Filter on top of the CustomScan

Thank you @fr3aker for reporting this; I'll move this into the backlog!

timescale / timescaledb