Open miguelteixeiragomes opened 2 years ago
Hi,
Can you offer an example of what you think this query ought to look like and what kind of output you'd expect? And also an example of your workaround? I think I understand what you're after, but making it concrete would be a big help.
Thanks!
My workaround would be something like this:
create table tsdata_raw (
device_id int8,
time timestamptz,
value float8
);
select create_distributed_hypertable('tsdata', 'time', 'device_id');
create view tsdata_daily_almost with (timescaledb.continuous)
as select
device_id,
time_bucket(‘1 day’, time) as day,
min(time) as first_ts,
max(time) as last_ts,
argmin(time, value) as first_val, -- custom aggregation functions
argmax(time, value) as first_val,
average(time_weight(value, time, 'locf')) as value
from tsdata_raw
group by device_id, day;
create view tsdata_daily as select
device_id,
day,
value*(last_ts - min_ts) + last_val*(day + '1 day' - last_ts) + (first_ts - day)*(
select value
from tsdata_raw tsr
where tsr.device_id = tsd.device_id and time < last_ts
order by time
limit 1
) as value
from tsdata_daily_almost tsd
I've omitted some extract(epoch from ...)
and type conversions, but that is the idea. The problem I see with it is that you still have to fetch a lot pages from the raw data/indices, and that data might even be compressed (I'd like to use a compression policy on the compressed data).
Ideally - I don't know whether it would be possible - the time_weight should have an option to handle this missing data. Alternatively, a different workaround that does no require access to the raw data could also work.
I think this example also clarifies what output I'm looking for (to answer your first question), but if it's not clear, please tell me and I can maybe make some plots with the samples and bins to explain better.
Just a few minutes after posting the previous comment I realized that a much better workaround would be to apply the same concept of making a view but just do it all on the aggregate data, you can still find previous value there.
The difference is that you may have to fetch several values until you get non-null data, since some buckets may end up with no data at all. I can write up that work around later but I have to test it.
In any case, I would still welcome some other ideas or feedback on these ones.
We have a proposed workaround that should work for this in #440.
I realize it's been awhile, but we'd appreciate it if you could take a look at the issue and let us know if that solution would have solved this issue.
@miguelteixeiragomes if you haven't noticed, we've just included a solution (see #440) in the latest 1.8.0 release to address the kind of issues that you were experiencing. We'd love to have you try it out and give us feedback 🙏
I'm currently working on a very similar issue and the query structure outlined in https://github.com/timescale/timescaledb-toolkit/issues/548#issuecomment-1284392569 seems to work very well.
Unfortunately in the data I'm processing there is a special case where null
values mark times when no data was available. As such, the intervals should be shortened accordingly.
Example: the bucket size is
1h
. Based on LOCF, the value 10 from the previous bucket is carried. At 0:15 in the current bucket there is anull
value and at 0:30 the value is20
; the duration with data is now 45 minutes and the interpolated time-weighted average is 16.66.
After some reading of the toolkit source code it seems to me like this isn't really possible, with the current data structures: the LOCF
function, when invoked on its own already has a treat_null_as_missing
argument which sounds related. But in the TimeWeightSummary
structure, there is only TSPoint
whose value must always be f64
instead of Option<f64>
which would probably be required here.
I have a rough idea about how it could be implemented by extending the TimeWeightSummary
type to support both Option<f64>
and also adding an attribute for the duration for which there was data. This is (in a nutshell) how I implemented it in my own aggregate for testing purposes:
fn state_transition(&mut self, elem: (i64, Option<f64>)) {
self.update_state(&elem.0);
self.last = elem;
}
fn update_state(&mut self, ts: &i64) {
if let Some(prev_val) = self.last.1 {
let interval = duration_secs(&self.last.0, ts);
self.duration += interval;
self.acc += prev_val * interval;
}
}
then, when calculating the integral, only acc
is used and when calculating the average, I can decide whether to divide by the duration of the whole range or just by the duration for which there was data.
But if preferable I'd like to avoid having to install my own aggregates in addition to TimescaleDB-provided ones. So I wonder whether you'd be interested in discussing an enhancement there?
Maybe as an inspiration to @miguelteixeiragomes :
I'm using a UNION ALL
construct to pull in the data from before the processed timeframe:
WITH regular AS (SELECT ts, value
FROM "data"
WHERE "data"."ts" > '2022-08-19 11:00:00'
AND "data"."ts" <= '2022-08-20 21:00:00')
, starting AS (SELECT ts, val
FROM "data"
WHERE "data"."ts" <= '2022-08-19 11:00:00'
ORDER BY "data"."ts" DESC
LIMIT 1)
, relevant_data AS (SELECT * FROM starting UNION ALL SELECT * FROM regular)
(the rest of the query then follows the same pattern as in #548)
I'm not sure whether null-handling in interpolated aggregates is really the focus of this issue, so let me know if I should open a separate issue or discussion for this.
@milgner, thank you for the great feedback! It would be very helpful if you could open a separate issue for this as it would be an enhancement to what we've already developed.
What type of enhancement is this?
API improvement
What subsystems and features will be improved?
Continuous aggregate
What does the enhancement do?
When you have timeseries (time, value) pairs that needs to be seen as a continuous value with LOCF interpolation, an aggregate based on
time_weight
does not produce the desired result since it does not fetch the value before the the begining of the bucket. This is especially noticeable when the aggregation interval is relatively close to the larger intervals between samples.Implementation challenges
The only workaround I can think of is to use the continuous aggregates to produce the "nearly correct" result and then create a view on top of it that fetches the necessary data-point from before the bucket. However - although I've not tested it - I suspect this will a bit slow since for every aggregate point I still have to go through a non-aggregate index and heap to find one more row.