timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.78k stars 886 forks source link

add an option for interpolate to ignore NULL values #1453

Open suiluj opened 5 years ago

suiluj commented 5 years ago

hello i am trying to use the time_bucket_gapfill function but it does not work as expected.

i created a query like describe in this blog article: https://blog.timescale.com/blog/sql-functions-for-time-series-analysis/

this is my query:

SELECT
  time_bucket_gapfill(
    '1 seconds',
    time,
    start => '2019-01-02 10:00:00+00',
    finish => '2019-01-02 16:00:00+00'
  ) AS time_1s,
  -- instead of avg(val)
  vin,
  avg(rtd_rtd) as rtd_rtd_no_interpolation,
  interpolate(avg(rtd_rtd)) as rtd_rtd_interpolate
FROM
  signals_continuous_numeric
WHERE
  time > '2019-01-02 10:00:00+00'
  AND time < '2019-01-02 16:00:00+00'
  AND vin = 'xxx'
GROUP BY time_1s, vin
ORDER BY time_1s, vin;

The column rtd_rtd_interpolate still contains null values:

time_1s vin rtd_rtd_no_interpolation rtd_rtd_interpolate
2019-01-02 10:00:00+00 xxx
2019-01-02 10:00:01+00 xxx 0.18751000000000000000 0.18751
2019-01-02 10:00:02+00 xxx
2019-01-02 10:00:03+00 xxx 0.18753000000000000000 0.18753
2019-01-02 10:00:04+00 xxx
2019-01-02 10:00:05+00 xxx 0.20313000000000000000 0.20313

I thought i could interpolate the null values with time_bucket_gapfill and interpolate(avg(<column_to_interpolate>)). Perhaps i am using it wrong?

svenklemm commented 5 years ago

gapfilling is not designed for interpolating NULL values but for actual missing data. If you have NULL values interpolate will not interpolate those. There is an option to treat NULLs as missing values for locf but not for interpolate. As a workaround you can add a WHERE clause that filters NULL values

suiluj commented 5 years ago

@svenklemm ah thanks Sven. oh by the way i am in Dresden, too. :)

i added the condition AND rtd_rtd is not null and now it interpolates this column.

But the main reason for interpolation was to match lower frequency measurements with other columns of this table which have a value every second. Now these higher frequency values get lost of course.

Are you planning to support interpolate() on null values too? Or is there another way to get to this result. time_bucket_gapfill in combination with interpolate() would be a really elegant solution for this kind of queries. I really would like to query multiple columns at a specific frequency (using a continuous aggregate) and then interpolate all columns which contain null values because of too low frequencies.

It would be great because otherwise you had to implement this feature in the application layer.

FourSeventy commented 5 years ago

I have a similar use case to @suiluj and it would also help me to be able to interpolate null values. Thank you.

saasrack commented 3 years ago

:+1: would also like to see treat_null_as_missing on interpolation

goodkiller commented 1 year ago

+1, please implement it :)

maxhertrampf commented 1 year ago

We have a simliar use case and would really appreciate it if interpolate() would support treat_null_as_missing like locf() does.

moodgorning commented 3 months ago

+1 from my side, I just wasted a whole bunch of hours because it behaves very unexpectedly if you are not aware of this behavior.

anerg commented 2 months ago

These functions are a joke (minimum effort is the best you can say about them). Why would you not have treat_null_as_missing for interpolate? Workaround only works for one column.

Also, why is there not cutoff for interpolate and locf where values stop being carried forward if no actual values are available over specified interval?

Why is there no look back at the beginning of the range to find previous value?

It's basically only enough for the most trivial cases, other than that and you immediately need to rewrite your code and do the data transformation after fetching data. So annoying.