timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.88k stars 882 forks source link

gap filling: maximum length #1402

Open zinji opened 5 years ago

zinji commented 5 years ago

Does the gap filling functions have option to limit the size of gaps being filled by setting maximum length of gap in number of time steps?

svenklemm commented 5 years ago

@zinji it currently does not but that's an interesting idea, whats the use case you had in mind for this functionality?

zinji commented 5 years ago

@svenklemm this functionality is highly useful when computing stats on the timeseries data such as mean daily or monthly temperature; dality rainfall totals etc.
usecase: From hourly rainfall totals, i want to compute daily rainfall totals to 9:00 am using gap filling to fill any gaps less that 4hrs and any gaps greater than 4hrs are left null.

moodgorning commented 5 years ago

I concur, we have a very similar use case at hand that would benefit very much from this. We have sensor readings that usually come in every 15 minutes. So we are ok with interpolating 1-2 missing values, but don't want to go further than that. Anything more would be treated as a gap in the data, i.e. NULL

moodgorning commented 5 years ago

I think I managed to achieve this through the following query: `CREATE TABLE timestamps ( ts timestamp, value numeric
);

INSERT INTO timestamps VALUES ('2015-05-01 12:00:00.0',1), ('2015-05-01 12:15:00.0',2), ('2015-05-01 12:30:00.0',3), ('2015-05-01 12:45:00.0',4), ('2015-05-01 13:00:00.0',5), ('2015-05-01 13:15:00.0',6), ('2015-05-01 13:30:00.0',7), ('2015-05-01 13:45:00.0',8), ('2015-05-01 14:00:00.0',9), ('2015-05-01 14:15:00.0',10), ('2015-05-01 15:00:00.0',7), ('2015-05-01 15:15:00.0',8), ('2015-05-01 15:30:00.0',9), ('2015-05-01 15:45:00.0',10);

WITH interpolated AS ( WITH deltas AS ( SELECT ts, value, extract(epoch from ts::timestamp) - lag(extract(epoch from ts), -1) OVER (ORDER BY ts) delta FROM raw_measurements ORDER BY ts ) SELECT time_bucket_gapfill( '5 minutes', ts, start => '2015-05-01 11:53:00', finish => '2015-05-01 17:00:00') AS minutes_5, interpolate(avg(value)) AS avg_val, locf(last(delta,ts)) AS locf_delta, last(delta,ts) AS deltaActual FROM deltas GROUP BY minutes_5 ) SELECT * from interpolated WHERE interpolated.locf_delta >= -%s OR NOT deltaActual IS NULL;` Not the most elegant way and I'm sure theres lots of room for improvement, but the general idea is to use interpolate over the whole dataset and generate a new column with the time to the next measurement which can then be used to drop all interpolations that were over a gap more than x seconds (900 in this example). This will not fill the gap with NULL values, but you could probably achieve that by running another time_bucket_gapfill over the current result.

svenklemm commented 5 years ago

@zinji ok so you would want gapfill to still produce the row with a timestamp but the locf/interpolate column would have NULL values

zinji commented 5 years ago

@svenklemm Yes, the query output should output the applicable timesteps with null values depending on the time bucket used.

moodgorning commented 4 years ago

Any update on this? This would be a really useful feature to have.

iam4x commented 3 years ago

I'm looking for this as well 🥺