tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
338 stars 67 forks source link

Sliding window resampled number of observations mismatched? #247

Closed PathosEthosLogos closed 3 years ago

PathosEthosLogos commented 3 years ago
df <- data.frame(y  = sample(5000000:120000000, 1000, replace = TRUE),
                 yearr = sample(2015:2021, 1000, replace = TRUE),
                 monthh = sample(1:12, 1000, replace = TRUE),
                 dayy = sample(1:29, 1000, replace = TRUE)) |>
  mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
         datee = ymd(paste(yearr, monthh, dayy))) |>
  filter(!is.na(datee)) |>
  arrange(-desc(datee))

set.seed(1)
folds = df |>
  sliding_period(lookback = Inf, # if Inf, then it's chain
                 #assess_stop = 4,
                 index = datee,
                 period = 'week',
                 every = sqrt(sqrt(nrow(df))) + 1)

rec_df = 
  recipe(y ~ ., data = df)

metric = metric_set(rmse) # mae, or accuracy and roc_auc for classifications

grid_ctrl = control_stack_grid()
res_ctrl = control_stack_resamples()

### Bag Mars
spec_bag_mars = baguette::bag_mars() |>
  set_mode('regression') |>
  set_engine('earth')

rec_bag_mars = rec_df |> # Recipe
  #step_dummy(all_nominal()) |>
  step_zv(all_numeric(), all_outcomes()) |>
  step_normalize(all_numeric(), -all_outcomes())
#step_zv(all_predictors(), skip = TRUE) |>
#step_normalize(all_numeric(), skip = TRUE)

wf_bag_mars = workflow() |>
  add_model(spec_bag_mars) |>
  add_recipe(rec_bag_mars)

resample_bag_mars = fit_resamples(wf_bag_mars,
                                  resamples = folds,
                                  metrics = metric,
                                  control = res_ctrl)

###

### Random forest

spec_random_forest = rand_forest(mtry = 3, min_n = 4) |>
  set_engine('randomForest') |>
  set_mode('regression')

rec_random_forest = rec_df |> # Recipe
  step_dummy(all_nominal()) |>
  step_zv(all_numeric(), all_outcomes()) |>
  step_normalize(all_numeric(), -all_outcomes())

wf_random_forest = workflow() |>
  add_model(spec_random_forest) |>
  add_recipe(rec_random_forest)

resample_random_forest = fit_resamples(wf_random_forest,
                                       resamples = folds,
                                       metrics = metric,
                                       control = res_ctrl)

stack_enet = stacks() |>
  add_candidates(resample_bag_mars) |>
  add_candidates(resample_random_forest)

model_stack = stack_enet |>
  blend_predictions() # Check weights

autoplot(model_stack)
autoplot(model_stack, type = 'members')
autoplot(model_stack, type = 'weights')

model_stack = model_stack |>
  fit_members()

model_stack |>
  collect_parameters('resample_bag_mars')
#model_stack |>
#  collect_parameters('resample_prophet_boost')
model_stack |>
  collect_parameters('resample_random_forest')

df_pred = 
  df |>
  bind_cols(predict(model_stack, df)) |>
  #df_test |>
  #bind_cols(predict(model_stack, df_test)) |>
  rename(y_hat = .pred)

df_pred |>
  ggplot() +
  geom_line(aes(y = y_hat,
                x = ymd(datee)),
            colour = 'red',
            size = 1.5) +
  geom_line(aes(y = y,
                x = ymd(datee))) +
  ggtitle('First model stacking attempt')

df_pred |>
  ggplot() +
  geom_point(aes(x = y,
                 y = y_hat)) + 
  coord_obs_pred()

# Performance
member_preds =
  df |>
  select(y) |>
  bind_cols(predict(model_stack, df, members = TRUE))

This is where it gets stuck. The debug message I get is:

Error: Can't recycle `..1` (size 999) to match `..2` (size 0).

My (stretched) guess is that it has something to do with sliding_period() at the beginning, when each cross validation fold or window is being created, the first few and last few being ignored.

DavisVaughan commented 3 years ago

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page.

You can install reprex by running (you may already have it, though, if you have the tidyverse package installed):

install.packages("reprex")

Notably, the above example is not reproducible for us because you call sample() before setting the seed. So we can't generate the exact df column you have here. Below I have a reproducible example that actually runs fine, the line you pointed out doesn't error with this seed. If you want to have a go at changing the seed to see if you can make it error, that might work.

My guess is that sliding_period() may have generated an assessment set with 0 rows in it (this would be dependent on the seed)? And that is causing issue somewhere along the way. I don't have any proof though. That wouldn't be an rsample bug, since that is perfectly valid.

library(tidyverse)
library(rsample)
library(lubridate)
library(tidymodels)
library(stacks)
library(baguette)

set.seed(123)

df <- data.frame(y  = sample(5000000:120000000, 1000, replace = TRUE),
                 yearr = sample(2015:2021, 1000, replace = TRUE),
                 monthh = sample(1:12, 1000, replace = TRUE),
                 dayy = sample(1:29, 1000, replace = TRUE)) |>
  mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
         datee = ymd(paste(yearr, monthh, dayy))) |>
  filter(!is.na(datee)) |>
  arrange(-desc(datee))

folds = df |>
  sliding_period(lookback = Inf, # if Inf, then it's chain
                 #assess_stop = 4,
                 index = datee,
                 period = 'week',
                 every = sqrt(sqrt(nrow(df))) + 1)

rec_df = 
  recipe(y ~ ., data = df)

metric = metric_set(rmse) # mae, or accuracy and roc_auc for classifications

grid_ctrl = control_stack_grid()
res_ctrl = control_stack_resamples()

### Bag Mars
spec_bag_mars = bag_mars() |>
  set_mode('regression') |>
  set_engine('earth')

rec_bag_mars = rec_df |> # Recipe
  #step_dummy(all_nominal()) |>
  step_zv(all_numeric(), all_outcomes()) |>
  step_normalize(all_numeric(), -all_outcomes())
#step_zv(all_predictors(), skip = TRUE) |>
#step_normalize(all_numeric(), skip = TRUE)

wf_bag_mars = workflow() |>
  add_model(spec_bag_mars) |>
  add_recipe(rec_bag_mars)

resample_bag_mars = fit_resamples(wf_bag_mars,
                                  resamples = folds,
                                  metrics = metric,
                                  control = res_ctrl)

###

### Random forest

spec_random_forest = rand_forest(mtry = 3, min_n = 4) |>
  set_engine('randomForest') |>
  set_mode('regression')

rec_random_forest = rec_df |> # Recipe
  step_dummy(all_nominal()) |>
  step_zv(all_numeric(), all_outcomes()) |>
  step_normalize(all_numeric(), -all_outcomes())

wf_random_forest = workflow() |>
  add_model(spec_random_forest) |>
  add_recipe(rec_random_forest)

resample_random_forest = fit_resamples(wf_random_forest,
                                       resamples = folds,
                                       metrics = metric,
                                       control = res_ctrl)

stack_enet = stacks() |>
  add_candidates(resample_bag_mars) |>
  add_candidates(resample_random_forest)

model_stack = stack_enet |>
  blend_predictions() # Check weights

autoplot(model_stack)
autoplot(model_stack, type = 'members')
autoplot(model_stack, type = 'weights')

model_stack = model_stack |>
  fit_members()

model_stack |>
  collect_parameters('resample_bag_mars')
#model_stack |>
#  collect_parameters('resample_prophet_boost')
model_stack |>
  collect_parameters('resample_random_forest')

df_pred = 
  df |>
  bind_cols(predict(model_stack, df)) |>
  #df_test |>
  #bind_cols(predict(model_stack, df_test)) |>
  rename(y_hat = .pred)

df_pred |>
  ggplot() +
  geom_line(aes(y = y_hat,
                x = ymd(datee)),
            colour = 'red',
            size = 1.5) +
  geom_line(aes(y = y,
                x = ymd(datee))) +
  ggtitle('First model stacking attempt')

df_pred |>
  ggplot() +
  geom_point(aes(x = y,
                 y = y_hat)) + 
  coord_obs_pred()

# Performance
member_preds =
  df |>
  select(y) |>
  bind_cols(predict(model_stack, df, members = TRUE))
PathosEthosLogos commented 3 years ago

Confirmed that this works! Thanks

I find it strange that setting the seed before or after the df was generated matters. I'm feel like I understand the gist of why, but maybe not completely.

Sorry about the reprex situation. Previously, I tried to use it for another issue posted on Github and I tried again today, but it didn't work for me. Sorry that I didn't include the libraries used for this example.

Case closed

PathosEthosLogos commented 3 years ago

With new dataset, just changing the sample size from 1000 to 100:

df <- data.frame(y  = sample(5000000:120000000, 100, replace = TRUE),
                 yearr = sample(2015:2021, 100, replace = TRUE),
                 monthh = sample(1:12, 100, replace = TRUE),
                 dayy = sample(1:29, 100, replace = TRUE)) |>
  mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
         datee = ymd(paste(yearr, monthh, dayy))) |>
  filter(!is.na(datee)) |>
  arrange(-desc(datee))

It throws

> resample_bag_mars = fit_resamples(wf_bag_mars,
+                                   resamples = folds,
+                                   metrics = metric,
+                                   control = res_ctrl)
x Slice01: preprocessor 1/1, model 1/1: Error: Input must be a vector, not NULL.
...

and

> resample_random_forest = fit_resamples(wf_random_forest,
+                                        resamples = folds,
+                                        metrics = metric,
+                                        control = res_ctrl)
! Slice01: preprocessor 1/1, model 1/1: The response has five or fewer unique values.  Are you sure you want to do reg...
...

I'm not sure why it made so many slices.

What you mentioned here:

My guess is that sliding_period() may have generated an assessment set with 0 rows in it

I believe this is what the error message is referring to, correct?

DavisVaughan commented 3 years ago

Unfortunately it is very hard to debug this without a reprex

PathosEthosLogos commented 3 years ago

Unfortunately it is very hard to debug this without a reprex

Ah sorry, I forgot to add the df change. It's just reduced from 1000 -> 100.

Edit: tried reprex a few times, hasn't worked for me :/ but if you just copy/paste everything here, it's reproducible as a whole.

juliasilge commented 3 years ago

If you are having trouble with reprex, you might take a look at reading this article or watching this video. It really is a huge step toward people being able to help you out.

It looks like Davis is correct, and the that resampling strategy results in empty assessment sets:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(rsample)

df <- data.frame(y  = sample(5000000:120000000, 100, replace = TRUE),
                 yearr = sample(2015:2021, 100, replace = TRUE),
                 monthh = sample(1:12, 100, replace = TRUE),
                 dayy = sample(1:29, 100, replace = TRUE)) |>
   mutate(weekk = week(ymd(paste(yearr, monthh, dayy))),
          datee = ymd(paste(yearr, monthh, dayy))) |>
   filter(!is.na(datee)) |>
   arrange(-desc(datee))
#> Warning: 1 failed to parse.

#> Warning: 1 failed to parse.

folds = df |>
   sliding_period(lookback = Inf, # if Inf, then it's chain
                  #assess_stop = 4,
                  index = datee,
                  period = 'week',
                  every = sqrt(sqrt(nrow(df))) + 1)

folds
#> # Sliding period resampling 
#> # A tibble: 58 x 2
#>    splits         id     
#>    <list>         <chr>  
#>  1 <split [2/2]>  Slice01
#>  2 <split [4/1]>  Slice02
#>  3 <split [5/0]>  Slice03
#>  4 <split [6/0]>  Slice04
#>  5 <split [7/1]>  Slice05
#>  6 <split [8/1]>  Slice06
#>  7 <split [9/1]>  Slice07
#>  8 <split [10/0]> Slice08
#>  9 <split [11/1]> Slice09
#> 10 <split [12/2]> Slice10
#> # … with 48 more rows

Created on 2021-06-23 by the reprex package (v2.0.0)

This isn't a bug in rsample, though; it's a result of how the sliding periods are set up. You'll need to set up your resampling strategy so that you have analysis and assessment sets that are useful for your modeling, for example, in the workflow sets you want to evaluate.

PathosEthosLogos commented 3 years ago

If you are having trouble with reprex, you might take a look at reading this article or watching this video. It really is a huge step toward people being able to help you out.

I saw the first one already actually. I will have to try to see if anything changes by following the second one. I think it will just come down to reinstalling R/R Studio, but after I finish a couple things first.

It does indeed seem like empty assessment sets is the root of the problem.

I was able to get more reasonable number of sliding period resampling (a sample size of 100, but 58 resamples already sounded suspicious) alongside reasonable number of assessment sets. I wasn't sure how workflow would affected the number of assessment sets, so I didn't touch the workflow. However, I did tweak the arguments of sliding_period, particularly assess_stop and every. I set assess_stop = 2 and every = 15 and got more reasonable results. It seems to come up with random number of assessment sets. It seems that I misunderstood the documentation on what every argument does (I still don't fully understand, among other arguments), so I'll be searching for some clarifications online, especially to not make the number of assessment sets random.

Thanks

juliasilge commented 3 years ago

@PathosEthosLogos A good place to ask for help, for example on how to get every to do what you want in your particular situation, would be RStudio Community.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.