scicloj / tablecloth.time

Tools for the processing and manipulation of time-series data in Clojure.
Other
18 stars 1 forks source link

Add "resampling" / change interval support #14

Closed ezmiller closed 3 years ago

ezmiller commented 3 years ago

Goal

Add (partial!) support for what Pandas calls "resampling", i.e. changing the frequency or interval of data.

Solution

This PR tries to solve this by adding a function:

adjust-interval: ([dataset index-column-name keys time-converter new-column-name])

This function is the lowest level function. As with others, we will likely ultimate not ask users to supply the name of the index column but figure some of that our for them.

Nevertheless, Intended usage of this function is something like:

(-> dataset
    (adjust-interval :idx [:key] converter/->minutes :minutes))

The adjust-interval function is really just a thin layer over a group-by operation. It encapsulates two steps: 1) Using the supplied time-converter to generate a new data column and adding that to the dataset; 2) running a group-by on that new column as well as the columns specified in keys.

Most of the "adjustment" here is actually performed by the new functions provided in the tablecloth.time.api.converters namespace. The functions that in this namespace that are useful for adjust-interval, such as ->minutes, essentially take a datetime and return a new time reflecting the desired adjustment. We can think of these functions as "bucketing" the times in a column into the newly desired interval. Please take a look at the tests to get a better sense of how they transform time.

One of the more interesting of these functions is ->every that can be used to achieve more customized intervals. Eg..:

  (-> ds
      (adjust-interval :idx nil (converters/->every 5 :seconds) :every-15-seconds)
      tablecloth.api/ungroup)
;; => _unnamed [120 5]:
;;    |                 :idx |  :key-a |  :key-b |       :value |    :every-5-seconds |
;;    |----------------------|---------|---------|--------------|----------------------|
;;    | 1970-01-01T00:00:00Z |     FOO |     BAR | 186.91246519 | 1970-01-01T00:00:00Z |
;;    | 1970-01-01T00:00:01Z |     FOO |     BAR |  67.06385235 | 1970-01-01T00:00:00Z |
;;    | 1970-01-01T00:00:02Z |     FOO |     BAR |  99.35787614 | 1970-01-01T00:00:00Z |
;;    | 1970-01-01T00:00:03Z |     FOO |     BAR | 124.48492417 | 1970-01-01T00:00:00Z |
;;    | 1970-01-01T00:00:04Z |     FOO |     BAR | 144.47537739 | 1970-01-01T00:00:00Z |
;;    | 1970-01-01T00:00:05Z |     FOO |     BAR |  70.53844264 | 1970-01-01T00:00:05Z |
;;    | 1970-01-01T00:00:06Z |     FOO |     BAR | 100.57701589 | 1970-01-01T00:00:05Z |
;;    | 1970-01-01T00:00:07Z |     FOO |     BAR |  19.82559998 | 1970-01-01T00:00:05Z |
;;    | 1970-01-01T00:00:08Z |     FOO |     BAR |  19.69080540 | 1970-01-01T00:00:05Z |
;;    | 1970-01-01T00:00:09Z |     FOO |     BAR | 174.02116109 | 1970-01-01T00:00:05Z |
;;    | 1970-01-01T00:00:10Z |     FOO |     BAR | 108.18855562 | 1970-01-01T00:00:10Z |
;;    | 1970-01-01T00:00:11Z |     FOO |     BAR | 144.56180042 | 1970-01-01T00:00:10Z |
;;    | 1970-01-01T00:00:12Z |     FOO |     BAR | 100.19789533 | 1970-01-01T00:00:10Z |
;;    | 1970-01-01T00:00:13Z |     FOO |     BAR |  39.05449905 | 1970-01-01T00:00:10Z |
;;    | 1970-01-01T00:00:14Z |     FOO |     BAR | 188.76227780 | 1970-01-01T00:00:10Z |
;;    | 1970-01-01T00:00:15Z |     FOO |     BAR | 139.85187655 | 1970-01-01T00:00:15Z |
;;    | 1970-01-01T00:00:16Z |     FOO |     BAR |  25.51739398 | 1970-01-01T00:00:15Z |
;;    | 1970-01-01T00:00:17Z |     FOO |     BAR |  72.48937803 | 1970-01-01T00:00:15Z |
;;    | 1970-01-01T00:00:18Z |     FOO |     BAR |  46.10384505 | 1970-01-01T00:00:15Z |
;;    | 1970-01-01T00:00:19Z |     FOO |     BAR |  55.10977891 | 1970-01-01T00:00:15Z |
;;    | 1970-01-01T00:00:20Z |     FOO |     BAR | 170.36117511 | 1970-01-01T00:00:20Z |
;;    | 1970-01-01T00:00:21Z |     FOO |     BAR |  33.90174009 | 1970-01-01T00:00:20Z |
;;    | 1970-01-01T00:00:22Z |     FOO |     BAR | 173.89847616 | 1970-01-01T00:00:20Z |
;;    | 1970-01-01T00:00:23Z |     FOO |     BAR |  30.86616108 | 1970-01-01T00:00:20Z |
;;    | 1970-01-01T00:00:24Z |     FOO |     BAR |  56.88865863 | 1970-01-01T00:00:20Z |

Work remaining

daslu commented 3 years ago

@ezmiller that all looks wonderful and so elegant to me.

One small thought: I think the function times-series that is used in the tests actually creates something different (a "time index" maybe)?

ezmiller commented 3 years ago

@ezmiller that all looks wonderful and so elegant to me.

One small thought: I think the function times-series that is used in the tests actually creates something different (a "time index" maybe)?

@daslu I updated the name.

cnuernber commented 3 years ago

This looks great to me. Really well designed and carefully coded. There are some auto-detection routines for datatype that rely on converting the first element. All I might add to that is you may want to convert the first non-missing element; what if your first element is a missing/null value?

ezmiller commented 3 years ago

This looks great to me. Really well designed and carefully coded. There are some auto-detection routines for datatype that rely on converting the first element. All I might add to that is you may want to convert the first non-missing element; what if your first element is a missing/null value?

@cnuernber Thanks for making this point. I had not thought of that!