Closed ezmiller closed 3 years ago
Currently, our index function make-index will build an index on a column that has non unique values, and this doesn't make sense. Here's an illustration:
make-index
(def tst (tablecloth/dataset {:year [#time/year "2015" #time/year "2015" #time/year "2016" #time/year "2016"] :symbol ["AAPL" "MSFT" "AAPL" "MSFT"] :price [2 3 10 23]})) ;; => #'tablecloth.examples.working-with-time-series-2/tst (-> tst) ;; => _unnamed [4 3]: | :year | :symbol | :price | |-------|---------|--------| | 2015 | AAPL | 2 | | 2015 | MSFT | 3 | | 2016 | AAPL | 10 | | 2016 | MSFT | 23 | (-> tst (idx/index-by :year) meta) ;; => {:name "_unnamed", :index {#time/year "2015" 1, #time/year "2016" 3}} (-> tst (idx/index-by :year) (time/slice "2015" "2015")) ;; => _unnamed [1 3]: | :year | :symbol | :price | |-------|---------|--------| | 2015 | MSFT | 3 |
We get a slice that does not make sense because the index cannot understandably handle multiple rows with the same index-key value.
We should be able to index a dataset like this.
R checks to see if a combination of the index and the keys are unique:
> df1 date var1 var2 1 2015-01-01 AAPL 2 2 2015-01-01 MSFT 4 3 2016-01-01 AAPL 10 4 2016-01-01 MSFT 20 > df1 %>% as_tsibble(index = date) Error: A valid tsibble must have distinct rows identified by key and index. ℹ Please use `duplicates()` to check the duplicated rows. Run `rlang::last_error()` to see where the error occurred. > df1 %>% as_tsibble(index = date, key = var1) # A tsibble: 4 x 3 [365D] # Key: var1 [2] date var1 var2 <date> <fct> <dbl> 1 2015-01-01 AAPL 2 2 2016-01-01 AAPL 10 3 2015-01-01 MSFT 4 4 2016-01-01 MSFT 20
So...if you can create a tsibble it will have an index, and you can then group_by_key, adjust the frequency, and summarize:
group_by_key
> df1 %>% + as_tsibble(index = date, key = var1) %>% + group_by_key() %>% + index_by(year = ~ lubridate::year(.)) %>% + summarise( + var2 = sum(var2, na.rm = TRUE) + ) # A tsibble: 4 x 3 [1Y] # Key: var1 [2] var1 year var2 <fct> <dbl> <dbl> 1 AAPL 2015 6 2 AAPL 2016 10 3 MSFT 2015 9 4 MSFT 2016 20
Here's code to explore this problem in Python. Python behaves differently. You can add an index with duplicate values:
>>> import pandas as pd >>> import numpy as np >>> data = { ... "symbol": ["AAPL", "MSFT", "AAPL", "MSFT"], ... "price": [10, 20, 5, 3] ... } >>> data {'symbol': ['AAPL', 'MSFT', 'AAPL', 'MSFT'], 'price': [10, 20, 5, 3]} >>> index = pd.DatetimeIndex(["2015", "2015", "2016", "2016"]) >>> index DatetimeIndex(['2015-01-01', '2015-01-01', '2016-01-01', '2016-01-01'], dtype='datetime64[ns]', freq=None) >>> df = pd.DataFrame(data, index=index) >>> df symbol price 2015-01-01 AAPL 10 2015-01-01 MSFT 20 2016-01-01 AAPL 5 2016-01-01 MSFT 3
When you use the resample method it performs the aggregation you asked for but blows away the keys, which seem less than ideal:
resample
>>> df.resample('BA').sum() price 2015-12-31 30 2016-12-30 8
The slicing action, however, work as you would hope. It pulls the items in that specified slice range for both rows:
>>> df.loc["2015":"2015"] symbol price 2015-01-01 AAPL 10 2015-01-01 MSFT 20
Problem
Currently, our index function
make-index
will build an index on a column that has non unique values, and this doesn't make sense. Here's an illustration:We get a slice that does not make sense because the index cannot understandably handle multiple rows with the same index-key value.
We should be able to index a dataset like this.
How Other Tooling Handles This
R
R checks to see if a combination of the index and the keys are unique:
So...if you can create a tsibble it will have an index, and you can then
group_by_key
, adjust the frequency, and summarize:Python et al
Here's code to explore this problem in Python. Python behaves differently. You can add an index with duplicate values:
When you use the
resample
method it performs the aggregation you asked for but blows away the keys, which seem less than ideal:The slicing action, however, work as you would hope. It pulls the items in that specified slice range for both rows: