timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.83k stars 885 forks source link

Gap filling not working if no samples within range #2578

Open shaunsales opened 4 years ago

shaunsales commented 4 years ago

Hi,

We are big Timescale DB fans and have just migrated away from Redis Time Series to TSDB. However for some of our financial time series data, we're facing some issues displaying stocks that have zero daily trade volume.

For example, we want to pull today's prices for a stock with hourly time buckets. However that stock hasn't had any trades today, so we want to get the last price using locf and time_bucket_gapfill all buckets from start to end timestamp - basically just giving us a flat line of identical prices based on the previous close.

The problem we are having, is when there are no trades, time_bucket_gapfill and locf don't seem to work, and no time series data is returned. However if there is just one trade during the range, it works fine. Maybe we are missing something, but this has us scratching our heads. We're hoping for an elegant TSDB native solution rather than having to hack it in our code.

Below is the query we're using:

SELECT
time_bucket_gapfill(INTERVAL '1 hour', timestamp)  as bucket,
symbolid,
locf(LAST(price, timestamp), 
     (COALESCE((SELECT price FROM trades t2
      WHERE t2.symbolid = '01739423-0551-481a-9d53-d5db9b485ce6' AND t2.timestamp < '2020-01-01'
      ORDER BY timestamp DESC LIMIT 1), 0))) AS close
FROM trades
WHERE symbolid = '01739423-0551-481a-9d53-d5db9b485ce6'
AND timestamp >= '2020-01-01'
AND timestamp < '2020-01-02'
GROUP BY bucket, symbolid;

Any help would be much appreciated!

mfreed commented 4 years ago

Hi @shaunsales It sounds like you want the "prev" parameter in locf to "look back" out of the range:

https://docs.timescale.com/latest/api#locf

But from your second argument, it sounds like you are doing that? Do you see any different issues if you add the explicit start and finish argument in the gapfill function, and eliminate them from your WHERE clause?

shaunsales commented 4 years ago

Thanks @mfreed that actually got us most of the way there - oddly the TSDB docs recommend against this and suggest to only use the WHERE clause, and omit the gapfill params.

One other thing - when we use the method you suggested, we have an additional entry at the beginning of the data set that is the last entry prior to the specified range.

For example, if we ask for hourly samples from 2020-01-02 to 2020-01-03 we get 25 entries rather than 24. The first entry is the latest entry outside the range - i.e. 2020-01-01 01:00:00 and the remaining 24 entries are correct. Any idea what might be causing this?

mfreed commented 4 years ago

We don't typically recommend this because it's actually less efficient, as it can't do chunk exclusion at planning time...because you don't known a priori how "far back" you need to search to find the value preceding to the LOCF. So you can't exclude chunks "before" the WHERE clause during planning...because you might need to "look back" into these chunks.

So if you don't need to use the "look back", it's not recommended. But you are right that we should clarify the docs here.

Regarding your new issue, can you repost the specific query & output you are getting?

capogithub commented 4 years ago

Hi, I have the same issue, the following query return nothing.

SELECT id_time_series, 
       Time_bucket_gapfill('1 hour', MO.occurred_on) AS time_bucket, 
       Locf(First(MO.value, MO.occurred_on), (SELECT value 
                                              FROM   measurement_observation 
                                              WHERE  id_time_series = 5911 AND 
                                                     occurred_on < '2020-10-29' 
                                              ORDER  BY occurred_on DESC 
                                              LIMIT  1)) AS time_bucket_first 
FROM   measurement_observation MO 
WHERE  occurred_on >= '2020-10-29' AND 
       occurred_on <= '2020-10-30' AND 
       id_time_series = 5911 
GROUP  BY time_bucket, id_time_series 
ORDER  BY time_bucket 

But if i remove id_time_series from the SELECT and from the GROUP BY, the query return rows normally with the expected behaviour from locf second parameter

This query work as expected :

SELECT Time_bucket_gapfill('1 hour', MO.occurred_on) AS time_bucket, 
       Locf(First(MO.value, MO.occurred_on), (SELECT value 
                                              FROM   measurement_observation 
                                              WHERE  id_time_series = 5911 AND 
                                                     occurred_on < '2020-10-29' 
                                              ORDER  BY occurred_on DESC 
                                              LIMIT  1)) AS time_bucket_first 
FROM   measurement_observation MO 
WHERE  occurred_on >= '2020-10-29' AND 
       occurred_on <= '2020-10-30' AND 
       id_time_series = 5911 
GROUP  BY time_bucket
ORDER  BY time_bucket 

Any help would be much appreciated!

thoby1 commented 3 years ago

Hi, I have the same issue, the following query return nothing.

SELECT id_time_series, 
       Time_bucket_gapfill('1 hour', MO.occurred_on) AS time_bucket, 
       Locf(First(MO.value, MO.occurred_on), (SELECT value 
                                              FROM   measurement_observation 
                                              WHERE  id_time_series = 5911 AND 
                                                     occurred_on < '2020-10-29' 
                                              ORDER  BY occurred_on DESC 
                                              LIMIT  1)) AS time_bucket_first 
FROM   measurement_observation MO 
WHERE  occurred_on >= '2020-10-29' AND 
       occurred_on <= '2020-10-30' AND 
       id_time_series = 5911 
GROUP  BY time_bucket, id_time_series 
ORDER  BY time_bucket 

But if i remove id_time_series from the SELECT and from the GROUP BY, the query return rows normally with the expected behaviour from locf second parameter

This query work as expected :

SELECT Time_bucket_gapfill('1 hour', MO.occurred_on) AS time_bucket, 
       Locf(First(MO.value, MO.occurred_on), (SELECT value 
                                              FROM   measurement_observation 
                                              WHERE  id_time_series = 5911 AND 
                                                     occurred_on < '2020-10-29' 
                                              ORDER  BY occurred_on DESC 
                                              LIMIT  1)) AS time_bucket_first 
FROM   measurement_observation MO 
WHERE  occurred_on >= '2020-10-29' AND 
       occurred_on <= '2020-10-30' AND 
       id_time_series = 5911 
GROUP  BY time_bucket
ORDER  BY time_bucket 

Any help would be much appreciated!

Is there any traction on this issue? It's about a year now since this comment was posted. I'm experience the same issue with version 2.4.2 of TimescaleDB.

genedavis commented 2 years ago

Hi don't know if there's been more updates on this - I will just add that it would be worthwhile for time_bucket_gapfill to do its generation of time buckets without ANY data passed to it (other than a desired time range?), otherwise (as far as I can tell?) you have to use Postgres generate_series logic instead, which doesn't have the same "rounding" logic as time_bucket(_gapfill) has. For example, if I want 5-minute buckets, time_bucket (and _gapfill) will round to the hour, but generate_series does not do that automatically (at least without extra work).