scicloj / tablecloth.time

Tools for the processing and manipulation of time-series data in Clojure.
Other
18 stars 1 forks source link

Handle datasets with non-unique values in index column #13

Closed ezmiller closed 3 years ago

ezmiller commented 3 years ago

Problem

Currently, our index function make-index will build an index on a column that has non unique values, and this doesn't make sense. Here's an illustration:

(def tst
  (tablecloth/dataset {:year [#time/year "2015"
                                    #time/year "2015"
                                    #time/year "2016"
                                    #time/year "2016"]
                             :symbol ["AAPL"
                                      "MSFT"
                                      "AAPL"
                                      "MSFT"]
                             :price [2 3 10 23]}))
;; => #'tablecloth.examples.working-with-time-series-2/tst

(-> tst)
;; => _unnamed [4 3]:

| :year | :symbol | :price |
|-------|---------|--------|
|  2015 |    AAPL |      2 |
|  2015 |    MSFT |      3 |
|  2016 |    AAPL |     10 |
|  2016 |    MSFT |     23 |

(-> tst
    (idx/index-by :year)
    meta)
;; => {:name "_unnamed", :index {#time/year "2015" 1, #time/year "2016" 3}}

(-> tst
    (idx/index-by :year)
    (time/slice "2015" "2015"))
;; => _unnamed [1 3]:

| :year | :symbol | :price |
|-------|---------|--------|
|  2015 |    MSFT |      3 |

We get a slice that does not make sense because the index cannot understandably handle multiple rows with the same index-key value.

We should be able to index a dataset like this.

How Other Tooling Handles This

R

R checks to see if a combination of the index and the keys are unique:

> df1
        date var1 var2
1 2015-01-01 AAPL    2
2 2015-01-01 MSFT    4
3 2016-01-01 AAPL   10
4 2016-01-01 MSFT   20

> df1 %>% as_tsibble(index = date)
Error: A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.
Run `rlang::last_error()` to see where the error occurred.

> df1 %>% as_tsibble(index = date, key = var1)
# A tsibble: 4 x 3 [365D]
# Key:       var1 [2]
  date       var1   var2
  <date>     <fct> <dbl>
1 2015-01-01 AAPL      2
2 2016-01-01 AAPL     10
3 2015-01-01 MSFT      4
4 2016-01-01 MSFT     20

So...if you can create a tsibble it will have an index, and you can then group_by_key, adjust the frequency, and summarize:

> df1 %>% 
+   as_tsibble(index = date, key = var1) %>%
+   group_by_key() %>%
+   index_by(year = ~ lubridate::year(.)) %>%
+   summarise(
+     var2 = sum(var2, na.rm = TRUE)
+   )
# A tsibble: 4 x 3 [1Y]
# Key:       var1 [2]
  var1   year  var2
  <fct> <dbl> <dbl>
1 AAPL   2015     6
2 AAPL   2016    10
3 MSFT   2015     9
4 MSFT   2016    20

Python et al

Here's code to explore this problem in Python. Python behaves differently. You can add an index with duplicate values:

>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     "symbol": ["AAPL", "MSFT", "AAPL", "MSFT"],
...     "price": [10, 20, 5, 3]
... }
>>> data
{'symbol': ['AAPL', 'MSFT', 'AAPL', 'MSFT'], 'price': [10, 20, 5, 3]}
>>> index =  pd.DatetimeIndex(["2015", "2015", "2016", "2016"])
>>> index
DatetimeIndex(['2015-01-01', '2015-01-01', '2016-01-01', '2016-01-01'], dtype='datetime64[ns]', freq=None)
>>> df = pd.DataFrame(data, index=index)
>>> df
           symbol  price
2015-01-01   AAPL     10
2015-01-01   MSFT     20
2016-01-01   AAPL      5
2016-01-01   MSFT      3

When you use the resample method it performs the aggregation you asked for but blows away the keys, which seem less than ideal:

>>> df.resample('BA').sum()
            price
2015-12-31     30
2016-12-30      8

The slicing action, however, work as you would hope. It pulls the items in that specified slice range for both rows:

>>> df.loc["2015":"2015"]
           symbol  price
2015-01-01   AAPL     10
2015-01-01   MSFT     20