Closed DavisVaughan closed 4 years ago
That's excellent! @mdancho84 and I spoke about adding something like this a while back. Would you like to do a PR and add some notes to rolling_origin
and maybe this vignette? I might be good to have a good time series data set in the package to use for examples.
Sure I can:
rolling_origin()
rolling_origin(drinks_nested_yearly, initial = 20, assess = 1, cumulative = FALSE)
and explain the benefits of this with irregular time series (even though the drinks
data set looks to be regular)Would the drinks
dataset from FRED be a good one to include? I can redownload it and include it as well.
(Random side note, also adding a few ideas to a better R pkg for FRED data, fredr
)
@DavisVaughan this nesting trick is super cool. Any thoughts on how to extend this to respect gaps in non-adjacent forecasting horizons? If I want to use things I know this morning to predict tomorrow's closing price, I can't use yesterday's observations, since the response won't be known until after close today.
I was discussing this with Max a bit in #43, and included the code I use to do that now, but per your technique here nest()
+ rolling_origin()
(+ filter()
for selective sampling) could do everything but "respect the gap". My only thought is to avoid computing the response pre-sampling, and write a custom panel-data-compatible recipe (step_panel_lead()
?) which drops rows when dplyr::lead()
comes up NA
. Any clues on a simpler approach?
Can you lay out a full example for me? I looked through the code in the vc_resample
function but can't seem to piece together why its useful (im sure it is). Beyond needing a full example, a few thoughts are:
If you are standing at 9:30am 2018-07-13, and predicting 4pm 2018-07-14, why can't you use data from yesterday, 2018-07-12? It seems like at 9:30am today you would have yesterday's close.
If you are standing at today, the current vc_resample
function returns all the rows past the split date. This is kind of interesting, I definitely don't think rolling_origin
allows this right now. Is this purposeful and useful? Are you really predicting that many days out?
First, the questions:
vc_resample
isn't doing. It grabs all the after-observations so I can compare test observations across models trained at different times (split_dates
); is my model actually learning from recent examples? How much? If last month's model isn't much worse than last week's model at predicting today, that tells me something. If you train model A and B one month apart, but for each only predict the month after training, you don't know how much of that is because of changes in model (controllable) vs. changes in data generation process (not controllable).Suppose we're only tracking one thing over time, and we've got closing prices for every day this week (e.g. 4pm arrival) as well as some strange feature magic
, that we know at 9:30 am each day.
We could have that in a tibble like
library(tidyverse)
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl
#> # A tibble: 5 x 3
#> date close magic
#> <date> <dbl> <int>
#> 1 2018-07-09 100 1
#> 2 2018-07-10 200 2
#> 3 2018-07-11 300 3
#> 4 2018-07-12 400 4
#> 5 2018-07-13 500 5
We want to predict the percentage change in closing price from today's close and tomorrow's, so we compute that, then use rolling_origin
to make some time-dependent splits.
tbl <- tbl %>%
mutate(pct_change = (lead(close) - close) / close) %>%
filter(complete.cases(.)) # drops last row only
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
rset <- rolling_origin(tbl, initial = 1)
rset
#> # Rolling origin forecast resampling
#> # A tibble: 3 x 2
#> splits id
#> <list> <chr>
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3
Let's look at one of them
list(analysis(rset$splits[[1]]), assessment(rset$splits[[1]]))
#> [[1]]
#> # A tibble: 1 x 4
#> date close magic pct_change
#> <date> <dbl> <int> <dbl>
#> 1 2018-07-09 100 1 1
#>
#> [[2]]
#> # A tibble: 1 x 4
#> date close magic pct_change
#> <date> <dbl> <int> <dbl>
#> 1 2018-07-10 200 2 0.5
This violates the information barrier; if you could compute the response for 7/9, that means it's post-close on 7/10, so it's already much later (6.5 hours) than you would have wanted to make the prediction for the 7/10 observation. Or, vice versa, if you want to make a prediction for 7/10, it's that morning, so you don't have the close you'd need (7/10) to compute the response for the 7/9 observation.
Created on 2018-07-13 by the reprex package (v0.2.0).
vc_resample
is one way to respect this gap, via custom rsample
-ing. Another option would be to instead make a custom recipe where the recipe drops the appropriate training rows. A third option is to not pre-compute the response and have a custom recipe do the pct_change = (lead(close) - close) / close)
on each side of the split independently and drop NA's, so long as you make the assess
arg of rolling_origin
one longer, so 2 instead of the default 1 in this case.
Lemme know if that makes sense!
1) Oh I see so you're saying you need today's closing price (gotten at 4pm) as the dependent variable that goes along with yesterday's independent variables. (Answering the question, are yesterday's features predictive of today's closing price?). On the other hand, you could just run today's model at 4:01pm today so you'd have that data point. But I guess if you need to make trading decisions today based on your forecast of tomorrow's price, that won't work. (The exception is if you are not using today's close in your model, see the example below)
2) I think rolling_origin()
enforces the fact that your assessment set has to be of the same size for every slice. It would have to change to incorporate all of what you might want to do here (Im not against the idea, just stating a fact). I think with a suitably long data set you could still do a fixed assessment size that is really long and be able to compare last month's model with last week's.
Is this reasoning below not correct in your mind? Are you using the close
price in your model to predict tomorrow's pct change? That would complicate things, otherwise I think this reasoning is sound.
library(tidyverse)
library(rsample)
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl_2 <- tbl %>%
# This is what actually happens tomorrow
mutate(pct_change = (close / lag(close) - 1)) %>%
# We back up what actually happens tomorrow so we can use today's info to predict it
mutate(pct_change_tomorrow = lead(pct_change)) %>%
# Let's remove pct_change now because thats confusing otherwise
select(-pct_change) %>%
slice(-nrow(.)) %>%
# If I'm standing at 9:30am today, I can use `magic` today to predict tomorrow's change
# I cannot use `close` to predict tomorrow's change because i get that at 4pm
# but thats irrelevant because its not in the model and its just a feature
# that calculates what ill be predicting.
select(-close)
# At this point im not violating info barrier?
# I just created the response variable using info Ill get at 4pm, but im not
# using that 4pm info in my model so im ok.
tbl_2
#> # A tibble: 4 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-09 1 1
#> 2 2018-07-10 2 0.5
#> 3 2018-07-11 3 0.333
#> 4 2018-07-12 4 0.25
ro <- rolling_origin(tbl_2, 1, 1, cumulative = FALSE)
# This should be fine
analysis(ro$splits[[1]])
#> # A tibble: 1 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-09 1 1
assessment(ro$splits[[1]])
#> # A tibble: 1 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-10 2 0.5
In implementation, I'm standing at 2018-07-14
and I've got this trained model so when I get a new magic number at 9:30am I can say, ok this model tells me the prediction for the change in today's close (whatever that will be) to tomorrow's close (whatever that will be) is going to be XXX.
Another variant is that you can use the close from yesterday at 4pm.
Answering the question, are yesterday's features predictive of today's closing price?
That's right, and yes waiting until after 4pm/today's close is too late. The exception isn't quite right: when training, if I need today's close
in the same row as today's magic
, it doesn't make a difference if it's for a predictor or the outcome; if I need it, I need it. Yes, waiting until after close today is definitely too late. You're correct that needing today's close as a feature makes things worse, but that's on the prediction/assessment only; it would force me to wait until after close today, which is too late.
assess
piece, which could be fixed by allowing something like assess = Inf
or assess = -1
. Making assess
bigger isn't quite what I want; that cuts out both later-trained models and overlap across models further apart.In the example, notice your tbl_2
is identical to my tbl
and your first split is identical to mine; if there's a problem in one, there's a problem in both. I suspect your multi-step response construction is where you got confused; it's equivalent to what I did, but thinking of it as the also-equivalent lead(close) / close - 1
is perhaps even more clear.
To train on 7/9, I need 7/10 close to compute the response, so it must be after 4pm 7/10, which is too late to care about a prediction made with 7/10 features. After training through the 7/9 features, which I can do only after close on 7/10, the first prediction I could make in practice would be with the 7/11 features, since it would already be too late to care about the 7/10 prediction.
In your second example, you need to make initial = 2
so you don't start with that incomplete training set, but you're correct that using yesterday's close as a feature is no problem.
Maybe this would be more clear with a longer window: suppose I want to use today's magic
to predict the close in two days. We can also simplify the response to just be that future close, without any of the percent-change stuff; for the issues at hand it doesn't matter what our formula is there, what matters is how far away the furthest close we need is.
library(tidyverse)
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl
#> # A tibble: 5 x 3
#> date close magic
#> <date> <dbl> <int>
#> 1 2018-07-09 100 1
#> 2 2018-07-10 200 2
#> 3 2018-07-11 300 3
#> 4 2018-07-12 400 4
#> 5 2018-07-13 500 5
window <- 2
tbl <- tbl %>%
mutate(
lead_date = lead(date, window),
response = lead(close, window)
) %>%
filter(complete.cases(.)) # drops <window> rows from end
tbl
#> # A tibble: 3 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-09 100 1 2018-07-11 300
#> 2 2018-07-10 200 2 2018-07-12 400
#> 3 2018-07-11 300 3 2018-07-13 500
ro <- rolling_origin(tbl, 1, 1, cumulative = FALSE)
analysis(ro$splits[[1]])
#> # A tibble: 1 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-09 100 1 2018-07-11 300
assessment(ro$splits[[1]])
#> # A tibble: 1 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-10 200 2 2018-07-12 400
Created on 2018-07-14 by the reprex package (v0.2.0).
Now I can't train on that analysis
row until after close on 7/11, so the first thing I'd want in assessment
would have a date of 7/12. Conversely, when I'm ready to make a prediction using magic
from the morning of 7/10, the last close I could have trained on would be from 7/9, so the associated magic
would be from 7/7 (ignoring weekend), even if training happened before today's magic
came in (as it should). Thus two dates can't be trained on.
This is where long-term forecasting gets tricky; if you predict X days out, you also lose X days of training data(or X-1 if you don't have a time gap between features and responses, like waiting till post-close to predict), because you don't have their associated responses yet.
We could make our 9:30/4pm more explicit & store features separate from prices, and would if using tools like flyingfox
/zipline
to really make sure we're never looking ahead, but the idea here is to avoid getting into all that.
Does that make the need for observation censoring any more clear?
Another thing about my vc_resample
that can't be solved with nest
+ rolling origin
is accounting for different lengths of weeks (if not training each day). If I want to train only on weekends, skip
can't handle last week only having 4 trading days instead of 5. That's why I operate on date
/lead_date
to make my splits; I can control my training intervals according to an irregular calendar. Post-split filtering of the rset
object would do it too, but at the cost of generating way more than what I need.
Conceptually, I think it'd make more sense to only worry about applying the right information barrier to a no-leads-included table with the resampling and then generate the response, lead/lag features, and drop rows in a recipe, but there's so much boilerplate involved it'd be even more code than I have here, and I understand rsample
a bit better than recipes
right now. Maybe I'll explore this more in the future.
FYI @ClaytonYJ I haven't forgotten about this, just been busy trying to get a new pkg on CRAN with my free time. Hopefully I'll get to it in the next few days
Hi, I have a very similar concern. It is about combining rolling origin forecast resampling and group v-fold cross-validation in rsample. I have asked the question on SO. In fact my example is training and assassing on whole months only but the aim and the application that I think of is rather general (group can be some factor and nevertheless the time series structure should be preserved. I don't have the solution but the following was my example:
## generate some data
library(tidyverse)
library(lubridate)
library(rsample)
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates)
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))
This gives data of the following form
A tibble: 232 x 3
dates values month
<date> <dbl> <fctr>
1 2018-01-01 0.235 1
2 2018-01-02 0.363 1
3 2018-01-03 0.146 1
4 2018-01-04 0.668 1
5 2018-01-05 0.0995 1
6 2018-01-06 0.163 1
7 2018-01-07 0.0265 1
8 2018-01-08 0.273 1
9 2018-01-09 0.886 1
10 2018-01-10 0.239 1
Then we can e.g. produce samples that take 20 weeks of data and test on future 5 weeks (the parameter skip
skips some rows extra):
rolling_origin_resamples <- rolling_origin(
some_data,
initial = 7*20,
assess = 7*5,
cumulative = TRUE,
skip = 7
)
We can check the data with the following code and see no overlap:
rolling_origin_resamples$splits[[1]] %>% analysis %>% tail
# A tibble: 6 x 3
dates values month
<date> <dbl> <fctr>
1 2018-05-15 0.678 5
2 2018-05-16 0.00112 5
3 2018-05-17 0.339 5
4 2018-05-18 0.0864 5
5 2018-05-19 0.918 5
6 2018-05-20 0.317 5
### test data of first split:
rolling_origin_resamples$splits[[1]] %>% assessment
# A tibble: 6 x 3
dates values month
<date> <dbl> <fctr>
1 2018-05-21 0.912 5
2 2018-05-22 0.403 5
3 2018-05-23 0.366 5
4 2018-05-24 0.159 5
5 2018-05-25 0.223 5
6 2018-05-26 0.375 5
Alternatively we can split by months:
## sampling by month:
gcv_resamples = group_vfold_cv(some_data, group = "month", v = 5)
gcv_resamples$splits[[1]] %>% analysis %>% select(month) %>% summary
gcv_resamples$splits[[1]] %>% assessment %>% select(month) %>% summary
The solution by an SO user was a partial answer and not using rsample: split data into a list by month
df <- split(some_data, some_data$month)
lapply along list elements defining train and test sets
df <- lapply(seq_along(df)[-length(df)], function(x){
train <- do.call(rbind, df[1:x])
test <- df[x+1]
return(list(train = train,
test = test))
})
the result df is a list of 7 elements each containing a train and test data frames.
@rwarnung if I understand correctly, you want to apply rolling-forward CV at the month level instead of the date level; per @DavisVaughan solution above, we can achieve that with some nesting:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
# same data generation as before
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates)
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))
# need nest()
library(tidyr)
# nest by month, then resample
rset <- some_data %>%
group_by(month) %>%
nest() %>%
rolling_origin(initial = 1)
# doesn't show which month is which :(
rset
#> # Rolling origin forecast resampling
#> # A tibble: 7 x 2
#> splits id
#> <list> <chr>
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3
#> 4 <S3: rsplit> Slice4
#> 5 <S3: rsplit> Slice5
#> 6 <S3: rsplit> Slice6
#> 7 <S3: rsplit> Slice7
# only January (31 days)
analysis(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 31 x 2
#> dates values
#> <date> <dbl>
#> 1 2018-01-01 0.179
#> 2 2018-01-02 0.719
#> 3 2018-01-03 0.119
#> 4 2018-01-04 0.889
#> 5 2018-01-05 0.429
#> 6 2018-01-06 0.269
#> 7 2018-01-07 0.600
#> 8 2018-01-08 0.792
#> 9 2018-01-09 0.760
#> 10 2018-01-10 0.804
#> # ... with 21 more rows
# only February (28 days)
assessment(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 28 x 2
#> dates values
#> <date> <dbl>
#> 1 2018-02-01 0.645
#> 2 2018-02-02 0.233
#> 3 2018-02-03 0.321
#> 4 2018-02-04 0.0927
#> 5 2018-02-05 0.750
#> 6 2018-02-06 0.302
#> 7 2018-02-07 0.861
#> 8 2018-02-08 0.713
#> 9 2018-02-09 0.0454
#> 10 2018-02-10 0.656
#> # ... with 18 more rows
Created on 2018-08-24 by the reprex package (v0.2.0).
This does add some workflow overhead, as you have to "unpack" the splits ($data
). It also has the downside of hiding information about what is in each split, but you could add a mutate step to extract some info from each split (e.g. month of assesment set) and add it as a new column.
I should also point out that factor is probably a bad way to store month
in this case; I'd suggest either letting it be an integer, or using floor_date
so each is the first date of the month. This also makes it easier to follow my suggestion above to pull data into a new column of the rset
.
(@DavisVaughan the nest thing is all you, feel free to go take his bounty on SO)
Thank you two! (@DavisVaughan and @ClaytonJY) this nesting technique seems to be the solution. I will play some more with it but this is really elegant. Please go ahead to collect the bounty! thank you!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
This is not an issue, more of an example idea for
rolling_origin()
.I've been working closely with the tsibble team on some of their rolling functions, and we worked on rolling over irregular calendar periods. I think this could be a neat addition for
rolling_origin()
and luckily it's already built in. This could be a neat example for the docs somewhere, and I think the idea overall is pretty powerful, especially for time series modeling work where you want to ensure you are using calendar windows rather than fixed windows of say 5 periods.Created on 2018-07-10 by the reprex package (v0.2.0).