influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
769 stars 153 forks source link

Implementation of the Ramer-Douglas-Peucker Algorithm for reducing data points without losing fidelity #4653

Closed gefaila closed 2 years ago

gefaila commented 2 years ago

Proposal: Provide a function to implement the RDP algorithm found here or the advanced derivatives listed here

Current behavior: The only way to reduce data points are with the standard aggregate functions (min, max, mean, mode, first, last). There are a number of significant problems for reducing timeseries data with these algorithms

Desired behavior:

Alternatives considered: all existing aggregate functions in InfluxDB

Use case: Any industrial IoT use case where data compression is as important as NOT losing key features of the data.

gefaila commented 2 years ago

Note that the RDP algorithm is iterative and so it's unlikely that it can be written in Flux.

nathanielc commented 2 years ago

@gefaila Thanks for this issue, I took a quick look at the algorithm and the request makes a lot of sense. This will be possible to implement in Flux and will be most efficient if implemented in Flux as opposed to doing so in the UI. I'll move this issue to the flux repo and present it to the team.

gefaila commented 2 years ago

Thanks @nathanielc , I'm really excited about this because it's crucial to a whole level of functionality where we integrate our InfluxDB with a dashboard that is very limited on the number of points it can display.

Do you have an ETA, weeks?, months? years? Q2 2022? :-)

nathanielc commented 2 years ago

Most likely months. We plan to start on this in about a month and then I expect it will take a few weeks to actually build. Obviously this timeline is subject to change, but this is the current plan.

nathanielc commented 2 years ago

Another request for something similar using this algorithm has come in https://pkg.go.dev/github.com/dgryski/go-lttb. Maybe we can investigate both? See https://github.com/influxdata/flux/issues/4787

gefaila commented 2 years ago

Another request for something similar using this algorithm has come in https://pkg.go.dev/github.com/dgryski/go-lttb. Maybe we can investigate both? See #4787

I think customers would need both because they excel in slightly different ways:

Clearly, algorithms that can see all the data before compressing are going to have some advantages over algorithms that only know current and historical data. Our choice was RDP. Swinging door wouldn't do as good a job on retaining the actual features in the data we value.

nathanielc commented 2 years ago

The RDP algorithm has been implemented see https://github.com/influxdata/flux/pull/5078 Closing this issue, the Largest Triangle Three Buckets algo issue will remain open and we can add it alongside RDP once its done.

nathanielc commented 2 years ago

Link to RDP docs https://docs.influxdata.com/flux/v0.x/stdlib/experimental/polyline/rdp/

gefaila commented 2 years ago

This is great ! I have been looking forward to having this for a while! Excellent. Thanks for taking a look and implementing it!

nathanielc commented 2 years ago

@gefaila Glad to hear it, let us know how it works out for you. I have been using it personally to great success as well.