Closed PathosEthosLogos closed 3 years ago
Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.
If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page.
You can install reprex by running (you may already have it, though, if you have the tidyverse package installed):
install.packages("reprex")
Notably, the above example is not reproducible for us because you call sample()
before setting the seed. So we can't generate the exact df
column you have here. Below I have a reproducible example that actually runs fine, the line you pointed out doesn't error with this seed. If you want to have a go at changing the seed to see if you can make it error, that might work.
My guess is that sliding_period()
may have generated an assessment set with 0 rows in it (this would be dependent on the seed)? And that is causing issue somewhere along the way. I don't have any proof though. That wouldn't be an rsample bug, since that is perfectly valid.
library(tidyverse)
library(rsample)
library(lubridate)
library(tidymodels)
library(stacks)
library(baguette)
set.seed(123)
df <- data.frame(y = sample(5000000:120000000, 1000, replace = TRUE),
yearr = sample(2015:2021, 1000, replace = TRUE),
monthh = sample(1:12, 1000, replace = TRUE),
dayy = sample(1:29, 1000, replace = TRUE)) |>
mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
datee = ymd(paste(yearr, monthh, dayy))) |>
filter(!is.na(datee)) |>
arrange(-desc(datee))
folds = df |>
sliding_period(lookback = Inf, # if Inf, then it's chain
#assess_stop = 4,
index = datee,
period = 'week',
every = sqrt(sqrt(nrow(df))) + 1)
rec_df =
recipe(y ~ ., data = df)
metric = metric_set(rmse) # mae, or accuracy and roc_auc for classifications
grid_ctrl = control_stack_grid()
res_ctrl = control_stack_resamples()
### Bag Mars
spec_bag_mars = bag_mars() |>
set_mode('regression') |>
set_engine('earth')
rec_bag_mars = rec_df |> # Recipe
#step_dummy(all_nominal()) |>
step_zv(all_numeric(), all_outcomes()) |>
step_normalize(all_numeric(), -all_outcomes())
#step_zv(all_predictors(), skip = TRUE) |>
#step_normalize(all_numeric(), skip = TRUE)
wf_bag_mars = workflow() |>
add_model(spec_bag_mars) |>
add_recipe(rec_bag_mars)
resample_bag_mars = fit_resamples(wf_bag_mars,
resamples = folds,
metrics = metric,
control = res_ctrl)
###
### Random forest
spec_random_forest = rand_forest(mtry = 3, min_n = 4) |>
set_engine('randomForest') |>
set_mode('regression')
rec_random_forest = rec_df |> # Recipe
step_dummy(all_nominal()) |>
step_zv(all_numeric(), all_outcomes()) |>
step_normalize(all_numeric(), -all_outcomes())
wf_random_forest = workflow() |>
add_model(spec_random_forest) |>
add_recipe(rec_random_forest)
resample_random_forest = fit_resamples(wf_random_forest,
resamples = folds,
metrics = metric,
control = res_ctrl)
stack_enet = stacks() |>
add_candidates(resample_bag_mars) |>
add_candidates(resample_random_forest)
model_stack = stack_enet |>
blend_predictions() # Check weights
autoplot(model_stack)
autoplot(model_stack, type = 'members')
autoplot(model_stack, type = 'weights')
model_stack = model_stack |>
fit_members()
model_stack |>
collect_parameters('resample_bag_mars')
#model_stack |>
# collect_parameters('resample_prophet_boost')
model_stack |>
collect_parameters('resample_random_forest')
df_pred =
df |>
bind_cols(predict(model_stack, df)) |>
#df_test |>
#bind_cols(predict(model_stack, df_test)) |>
rename(y_hat = .pred)
df_pred |>
ggplot() +
geom_line(aes(y = y_hat,
x = ymd(datee)),
colour = 'red',
size = 1.5) +
geom_line(aes(y = y,
x = ymd(datee))) +
ggtitle('First model stacking attempt')
df_pred |>
ggplot() +
geom_point(aes(x = y,
y = y_hat)) +
coord_obs_pred()
# Performance
member_preds =
df |>
select(y) |>
bind_cols(predict(model_stack, df, members = TRUE))
Confirmed that this works! Thanks
I find it strange that setting the seed before or after the df
was generated matters. I'm feel like I understand the gist of why, but maybe not completely.
Sorry about the reprex
situation. Previously, I tried to use it for another issue posted on Github and I tried again today, but it didn't work for me. Sorry that I didn't include the libraries used for this example.
Case closed
With new dataset, just changing the sample size from 1000 to 100:
df <- data.frame(y = sample(5000000:120000000, 100, replace = TRUE),
yearr = sample(2015:2021, 100, replace = TRUE),
monthh = sample(1:12, 100, replace = TRUE),
dayy = sample(1:29, 100, replace = TRUE)) |>
mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
datee = ymd(paste(yearr, monthh, dayy))) |>
filter(!is.na(datee)) |>
arrange(-desc(datee))
It throws
> resample_bag_mars = fit_resamples(wf_bag_mars,
+ resamples = folds,
+ metrics = metric,
+ control = res_ctrl)
x Slice01: preprocessor 1/1, model 1/1: Error: Input must be a vector, not NULL.
...
and
> resample_random_forest = fit_resamples(wf_random_forest,
+ resamples = folds,
+ metrics = metric,
+ control = res_ctrl)
! Slice01: preprocessor 1/1, model 1/1: The response has five or fewer unique values. Are you sure you want to do reg...
...
I'm not sure why it made so many slices.
What you mentioned here:
My guess is that sliding_period() may have generated an assessment set with 0 rows in it
I believe this is what the error message is referring to, correct?
Unfortunately it is very hard to debug this without a reprex
Unfortunately it is very hard to debug this without a reprex
Ah sorry, I forgot to add the df change. It's just reduced from 1000 -> 100.
Edit: tried reprex a few times, hasn't worked for me :/ but if you just copy/paste everything here, it's reproducible as a whole.
If you are having trouble with reprex, you might take a look at reading this article or watching this video. It really is a huge step toward people being able to help you out.
It looks like Davis is correct, and the that resampling strategy results in empty assessment sets:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(rsample)
df <- data.frame(y = sample(5000000:120000000, 100, replace = TRUE),
yearr = sample(2015:2021, 100, replace = TRUE),
monthh = sample(1:12, 100, replace = TRUE),
dayy = sample(1:29, 100, replace = TRUE)) |>
mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
datee = ymd(paste(yearr, monthh, dayy))) |>
filter(!is.na(datee)) |>
arrange(-desc(datee))
#> Warning: 1 failed to parse.
#> Warning: 1 failed to parse.
folds = df |>
sliding_period(lookback = Inf, # if Inf, then it's chain
#assess_stop = 4,
index = datee,
period = 'week',
every = sqrt(sqrt(nrow(df))) + 1)
folds
#> # Sliding period resampling
#> # A tibble: 58 x 2
#> splits id
#> <list> <chr>
#> 1 <split [2/2]> Slice01
#> 2 <split [4/1]> Slice02
#> 3 <split [5/0]> Slice03
#> 4 <split [6/0]> Slice04
#> 5 <split [7/1]> Slice05
#> 6 <split [8/1]> Slice06
#> 7 <split [9/1]> Slice07
#> 8 <split [10/0]> Slice08
#> 9 <split [11/1]> Slice09
#> 10 <split [12/2]> Slice10
#> # … with 48 more rows
Created on 2021-06-23 by the reprex package (v2.0.0)
This isn't a bug in rsample, though; it's a result of how the sliding periods are set up. You'll need to set up your resampling strategy so that you have analysis and assessment sets that are useful for your modeling, for example, in the workflow sets you want to evaluate.
If you are having trouble with reprex, you might take a look at reading this article or watching this video. It really is a huge step toward people being able to help you out.
I saw the first one already actually. I will have to try to see if anything changes by following the second one. I think it will just come down to reinstalling R/R Studio, but after I finish a couple things first.
It does indeed seem like empty assessment sets is the root of the problem.
I was able to get more reasonable number of sliding period resampling (a sample size of 100, but 58 resamples already sounded suspicious) alongside reasonable number of assessment sets. I wasn't sure how workflow would affected the number of assessment sets, so I didn't touch the workflow. However, I did tweak the arguments of sliding_period
, particularly assess_stop
and every
. I set assess_stop = 2
and every = 15
and got more reasonable results. It seems to come up with random number of assessment sets. It seems that I misunderstood the documentation on what every
argument does (I still don't fully understand, among other arguments), so I'll be searching for some clarifications online, especially to not make the number of assessment sets random.
Thanks
@PathosEthosLogos A good place to ask for help, for example on how to get every
to do what you want in your particular situation, would be RStudio Community.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
This is where it gets stuck. The debug message I get is:
My (stretched) guess is that it has something to do with
sliding_period()
at the beginning, when each cross validation fold or window is being created, the first few and last few being ignored.