Missing values / Gap query

gjeusel commented 3 years ago

Is your feature request related to a problem? Please describe.

In the renewable energy area, we want to be conservative on missing values. Meaning, if a turbine didn't send any production for a certain period of time, we want to keep track of it.

This can lead to either missing indexes in the db, or indexes for which a column is set to Null or NaN.

It would be of interest to have a query to retrieve those gaps. The completeness of the timeseries might then be deduced, with the definition being the following ratio: (nrecords with value) / (nrecords with value + nrecords missing)

A wind turbine production with a 0.95 completness means 5% of the time it was in maintenance / stopped for any other reason.

Describe the solution you'd like

-- Sample data: mind the gaps in the timestamps
SELECT * FROM tsdata.power
+------------------------+------------+
| ts                     | power_kw   |
|------------------------+------------|
| 2019-12-31 23:00:00+00 | 743.0      |
| 2019-12-31 23:02:00+00 | 712.0      |
| 2019-12-31 23:03:00+00 | 696.0      |
| 2019-12-31 23:06:00+00 | 769.0      |
| 2019-12-31 23:07:00+00 | 840.0      |
| 2019-12-31 23:08:00+00 | 832.0      |
| 2019-12-31 23:09:00+00 | 850.0      |
| 2019-12-31 23:15:00+00 | 892.0      |
+------------------------+------------+

-- Select gaps that last 5 minutes
SELECT time_bucket_gaps('5 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps       |
|------------------------+
| 2019-12-31 23:10:00+00 |
+------------------------+

-- Select gaps that last 1 minutes
SELECT time_bucket_gaps('1 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps       |
|------------------------+
| 2019-12-31 23:01:00+00 |
| 2019-12-31 23:04:00+00 |
| 2019-12-31 23:05:00+00 |
| 2019-12-31 23:10:00+00 |
| 2019-12-31 23:11:00+00 |
| 2019-12-31 23:12:00+00 |
| 2019-12-31 23:13:00+00 |
| 2019-12-31 23:14:00+00 |
+------------------------+

-- Select gaps that last 2 minutes
SELECT time_bucket_gaps('2 minute', ts, power_kw) FROM tsdata.power
+------------------------+
| time_bucket_gaps       |
|------------------------+
| 2019-12-31 23:04:00+00 |
| 2019-12-31 23:10:00+00 |
| 2019-12-31 23:12:00+00 |
+------------------------+

Additional context

Often, the wind turbine captor is sending a value only when it measures a different value compared to previously, with a minimum interval of one minute.

This lead to unevenly-spaced timeseries:

                           power_kW
2020-01-01 00:00:00+01:00     743.0
2020-01-01 00:02:00+01:00     712.0
2020-01-01 00:03:00+01:00     696.0
2020-01-01 00:06:00+01:00     769.0
2020-01-01 00:07:00+01:00     840.0
2020-01-01 00:08:00+01:00     832.0
2020-01-01 00:09:00+01:00     850.0
2020-01-01 00:15:00+01:00     892.0

Or viewing it with NaNs:

                           power_kW
2020-01-01 00:00:00+01:00     743.0
2020-01-01 00:01:00+01:00       NaN
2020-01-01 00:02:00+01:00     712.0
2020-01-01 00:03:00+01:00     696.0
2020-01-01 00:04:00+01:00       NaN
2020-01-01 00:05:00+01:00       NaN
2020-01-01 00:06:00+01:00     769.0
2020-01-01 00:07:00+01:00     840.0
2020-01-01 00:08:00+01:00     832.0
2020-01-01 00:09:00+01:00     850.0
2020-01-01 00:10:00+01:00       NaN
2020-01-01 00:11:00+01:00       NaN
2020-01-01 00:12:00+01:00       NaN
2020-01-01 00:13:00+01:00       NaN
2020-01-01 00:14:00+01:00       NaN
2020-01-01 00:15:00+01:00     892.0

Hence the sample data used previously.

davidkohn88 commented 3 years ago

Hi! Have you looked at https://docs.timescale.com/latest/api#time_bucket_gapfill?

I think time_bucket_gapfill in a subquery followed by a count(*) filter (where power is null) + count(*) should give you the ratios you want?

Though I suppose that wouldn't work with continuous aggregates, which may be why you're asking more specifically about it.

ivaaaan commented 2 months ago

I was looking for something like this as well. Can be solved by this query:

select date(b), count(*) filter (where a.count is null) from 
   (select 
         time_bucket_gapfill('1m', bucket) as b, count(*) as count 
    from series_aggregegated_by_1m 
         where bucket between '2024-06-01 00:00:00' and now() and sensor_type = ... group by b) 
a group by date(b);

This one groups by day, but you can adjust it as you want.

timescale / timescaledb-toolkit

Missing values / Gap query #82