timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
16.89k stars 853 forks source link

[Feature]: Simplified pivoting without crosstab #6584

Open wz2b opened 5 months ago

wz2b commented 5 months ago

What problem does the new feature solve?

In IoT/IIoT space, it is often important to have the data storage be dynamic to the addition of new devices and sensors. For this reason, it's often preferable to use a narrow time series table structure like:

| time | parameterId | value |

but for many data analytics it's desirable to pivot that table into a wide format. When you do that, you will generally do some kind of time bucket based resampling often with a gapfill() and different interpolation strategies depending on the signal. For example, in the same query I might want the max(temp) and avg(power).

Doing this with crosstab() is not very friendly. The biggest problem is you have to be explicit about the resulting schema. In practice, this means that when you add one parameter you have to touch your query in at least one, usually two places.

I would like to propose adding some kind of pivot() function to TimescaleDB to make this easier and more dynamic.

I have two main reasons for wanting this:

I have done a ton of this kind of work in flux, where I found select |> group |> interpolate |> ungroup |> pivot to be a lot less verbose, easier to read (if you don't know flux), and totally dynamic (you can add categories without having to touch your query - you just insert new data and it shows up as a new column in the pivoted result).

What does the feature do?

What I have in mind is something that doesn't require you to provide the output schema in advance. One example of a way this could perhaps be implemented is this (don't take this query literally, I mean it to be notional)

SELECT pivot(
  $$
     SELECT time_bucket_gapfill( whatever ) as t,
            parameterId,  -- think of this like a 'category' in more traditional sql queries
            locf(last(value))
     FROM
            someTable
     WHERE
          time > start AND time <= end
          AND (parameterId == 1 OR parameterId == 3 OR parameterid == 4) -- OPTIONAL
     GROUP BY
             parameterId
$$)

In some cases the inner query might be a little more complicated, especially when you want to use different aggregation functions depending on the signal. I'm not entirely sure how I would do that, but I'm open to suggestions. It happens often enough that I'm not keen on the idea of unioning or joining. The point really is to make it easier to turn narrow tsdb tables into a wide format without having to describe the output schema statically, allowing for different interpolations (gap filling) on a category-by-category basis,

Postgres's crosstab() is a pain point. The main goal here is to make this kind of boilerplate data wrangling easier, because it's something I have to do pretty frequently.

Implementation challenges

I think the hardest part of this would be to allow different interpolations for each category. I also don't know if there are limitations that preclude dynamic record generation out of Postgres.

Mat-pa commented 5 months ago

+1

mkindahl commented 5 months ago

+1

Suggest you also upvote the proposal since this is something that the Eon Bot actually captures.

moritz93 commented 2 weeks ago

+1