timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
364 stars 46 forks source link

LTTB adapted to irregular sampling #501

Open Mat-pa opened 2 years ago

Mat-pa commented 2 years ago

Is your feature request related to a problem? Please describe. We have very irregularly sampled data, e.g. sometimes a value is valid for hours if the corresponding asset is idle during that time. When using LTTB one can get some strange behaviour. E.g. LTTB picks not the last value before the idle period but a value slightly before that. This causes a huge error when integrating over the idle period. LTTB only considers the error incurred at existing datapoints, thus the huge error in the idle period "goes unnoticed" from the perspective of LTTB.

Describe the solution you'd like Would be nice to have a LTTB-like algorithm that weights the error of each datapoint based on how long this datapoint is carried forward. This would probably ensure that the value which is valid during the idle period is actually shown.

Describe alternatives you've considered Use LOCF to resample before applying LTTB. This has two drawbacks:

Additional context Add any other context or screenshots about the feature request here

PastedGraphic-5

The shown time series should go back down to zero during idle (as marked in black). However, LTTB sometimes picks a value from the shutdown transient and not the last value (marked in red). Once one progressively zooms in on these idle periods, the value suddenly jumps to zero at some point as LTTB can pick a larger share of the datapoints.

davidkohn88 commented 2 years ago

So, I spent a bit of time looking into this, and I think there's a couple of things to consider:

I think this is a bit against the spirit of the algorithm, which does not give any extra weight to values based on how far apart they are, and also is trying to preserve outliers which, I think, leads to some of the artifacts that you're seeing. The algorithm is designed to pick up outliers, and so it's going to do that. So in order to do something like this we're going to need to modify the algorithm a bit. One simple modification would be to use the time weighted average rather than average for bucket calculations...but I doubt that would be enough.

Looking at the masters thesis that the LTTB algorithm comes from: https://skemman.is/bitstream/1946/15343/3/SS_MSthesis.pdf there is a different algorithm that might work better for irregularly sampled data generally, it's called largest triangle dynamic. I think we could think about doing something based on that algorithm and potentially modifying the algorithm to use the time weighted average instead of the average in the calculations. That algorithm does have some other problems and might be rather slow, so it might need some modifications.

Another option would be to look for any gaps greater than the bucket size (or potentially a configurable size threshold) and add a dynamic bucket that just contains the points at the edges of the gap. There could be some edge cases and we might go over the total samples we want, but if we're willing to take some license with that, I think that could be an option there.

Another interesting option would be to think about weighting data points based on the square they form with the next data point rather than on the triangle...at least as a first pass, or potentially in combination with the LTTB algorithm. I wonder if that would be a simple enough change, adding in the square parameter, which, looking back, may be exactly what you're proposing... very interesting.

In general, these algorithms need testing on some different data sets - can you post a version with the original data plotted compared to the lttb downsampled version? And do you have some data sets we could test this on (we can do that privately if you don't want to share on public github)?

Mat-pa commented 2 years ago

Thank you for taking the time to look into this. We will provide our Customer Success Manager a file with the data sets. She can forward it to you then (don't have your contact details)