[Bug]: where count clause ignored in timebucket_gapfill

BloodWorkXGaming commented 8 months ago

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Gapfill

What happened?

The WHERE clause on a count(*) on the group-size of a gapfill seems to be ignored. I hope I am not doing something very wrong here:

Minimal example:

select * from (
select time_bucket_gapfill('01:00:00'::interval, time, 'Europe/Berlin', '2023-10-10 06:00:00+00'::timestamptz, '2023-10-10 10:00:00+00'::timestamptz) AS tb,
count(value) as count
from (values
('2023-10-10 06:00:00+00'::timestamptz, 1), ('2023-10-10 06:01:00+00', 2)
)  as t(time, value)
group by tb) t
where count > 0

grafik Null values are still present. There are no changes to the result when using is not null or using HAVING. neither does a count = 2 work

Any ideas what could cause this behavior?

Thanks!

TimescaleDB version affected

2.10.2

PostgreSQL version used

14.7

What operating system did you use?

WSL

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

no crash, therefore no logs

How can we reproduce the bug?

select * from (
select time_bucket_gapfill('01:00:00'::interval, time, 'Europe/Berlin', '2023-10-10 06:00:00+00'::timestamptz, '2023-10-10 10:00:00+00'::timestamptz) AS tb,
count(value) as count
from (values
('2023-10-10 06:00:00+00'::timestamptz, 1), ('2023-10-10 06:01:00+00', 2)
)  as t(time, value)
group by tb) t
where count > 0

jnidzwetzki commented 8 months ago

Hello @BloodWorkXGaming,

Thanks for reaching out. I was able to reproduce the behavior in my local environment with TimescaleDB 2.13.0. The current version of time_bucket_gapfill has a few limitations. It seems you ran into one of them with the given query. This is the query plan for the statement:

test2=# explain select * from (
select time_bucket_gapfill('01:00:00'::interval, time, 'Europe/Berlin', '2023-10-10 06:00:00+00'::timestamptz, '2023-10-10 10:00:00+00'::timestamptz) AS tb,
count(value) as count
from (values
('2023-10-10 06:00:00+00'::timestamptz, 1), ('2023-10-10 06:01:00+00', 2)
)  as t(time, value)
group by tb) t
where count > 0;
                                                                                                       QUERY PLAN                            

---------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------
 Custom Scan (GapFill)  (cost=0.04..0.08 rows=1 width=0)
   ->  GroupAggregate  (cost=0.04..0.08 rows=1 width=0)
         Group Key: (time_bucket_gapfill('01:00:00'::interval, "*VALUES*".column1, 'Europe/Berlin'::text, '2023-10-10 08:00:00+02'::timestamp
 with time zone, '2023-10-10 12:00:00+02'::timestamp with time zone))
         Filter: (count("*VALUES*".column2) > 0)
         ->  Sort  (cost=0.04..0.05 rows=2 width=12)
               Sort Key: (time_bucket_gapfill('01:00:00'::interval, "*VALUES*".column1, 'Europe/Berlin'::text, '2023-10-10 08:00:00+02'::time
stamp with time zone, '2023-10-10 12:00:00+02'::timestamp with time zone))
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=12)
(7 rows)

Even if you specify a subquery, PostgreSQL optimizes it before execution and removes the subquery. As you can see in the query plan, the filter for where count > 0 is pushed down by PostgreSQL below the GapFill node. Therefore, it is applied before time_bucket_gapfill is executed.

As an alternative to the subquery, you can use a Common Table Expression (CTE):

explain WITH cte AS (
select time_bucket_gapfill('01:00:00'::interval, time, 'Europe/Berlin', '2023-10-10 06:00:00+00'::timestamptz, '2023-10-10 10:00:00+00'::timestamptz) AS tb,
count(value) as count
from (values
('2023-10-10 06:00:00+00'::timestamptz, 1), ('2023-10-10 06:01:00+00', 2)
)  as t(time, value)
group by tb)
SELECT * FROM cte where count > 0;
                                                                                                           QUERY PLAN                        

---------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
 CTE Scan on cte  (cost=0.08..0.12 rows=1 width=16)
   Filter: (count > 0)
   CTE cte
     ->  Custom Scan (GapFill)  (cost=0.04..0.08 rows=2 width=0)
           ->  GroupAggregate  (cost=0.04..0.08 rows=2 width=0)
                 Group Key: (time_bucket_gapfill('01:00:00'::interval, "*VALUES*".column1, 'Europe/Berlin'::text, '2023-10-10 08:00:00+02'::t
imestamp with time zone, '2023-10-10 12:00:00+02'::timestamp with time zone))
                 ->  Sort  (cost=0.04..0.05 rows=2 width=12)
                       Sort Key: (time_bucket_gapfill('01:00:00'::interval, "*VALUES*".column1, 'Europe/Berlin'::text, '2023-10-10 08:00:00+0
2'::timestamp with time zone, '2023-10-10 12:00:00+02'::timestamp with time zone))
                       ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=12)
(9 rows)

test2=#  WITH cte AS (
select time_bucket_gapfill('01:00:00'::interval, time, 'Europe/Berlin', '2023-10-10 06:00:00+00'::timestamptz, '2023-10-10 10:00:00+00'::timestamptz) AS tb,
count(value) as count
from (values
('2023-10-10 06:00:00+00'::timestamptz, 1), ('2023-10-10 06:01:00+00', 2)
)  as t(time, value)
group by tb)
SELECT * FROM cte where count > 0;
           tb           | count 
------------------------+-------
 2023-10-10 08:00:00+02 |     2
(1 row)

As you can see in this query plan, the Filter: (count > 0) operation is now performed after GapFill is executed and the query output is filtered as expected.

BloodWorkXGaming commented 8 months ago

Hi @jnidzwetzki Thanks for the quick answer! :)

I can confirm, that this workaround works perfectly fine, thanks for that :) I'll leave it up to you if you want to keep this issue open for further investigation or if you want to close it as a 'fix' is found.

timescale / timescaledb