utterances-bot commented 3 years ago

Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge

Learn how to use bootstrap aggregating to predict the duration of astronaut missions.

https://juliasilge.com/blog/astronaut-missions-bagging/

mwilson19 commented 3 years ago

QQ Julia - do you have any resources or examples of tuning Bagged Trees with ‘fit_resamples’ ? Like how do you optimize some of the hyper parameters? THANK YOU

juliasilge commented 3 years ago

@mwilson19 You can tune the hyperparameters pretty much like you do any other model in tidymodels; here is a small example to get you started.

mwilson19 commented 3 years ago

Great thank you for sharing!! So does the tuning optimization still work i.e. in your small example, it's not bootstrapping the decision trees (times = 25) within each tune bootstrap, or is it? Seems like that could take a long time. In your case 5 x 25 bootstraps or 125, then multiplied by any number that would be in the parameter tuning grid?

I thought I remembered tune does use optimization for OOB in bootstraps with something, like random forest.

Thank you!

juliasilge commented 3 years ago

Oh, it is fitting the bagged tree to each bootstrap resample, which does take a little while! Certainly it would for a more realistic data set. Often a bagged tree can do better than, say, an xgboost or similar model even without tuning (here is an example where that happened) but if you want to tune those hyperparameters, you do need to try them out on different resamples. You can of course use the "normal" tricks to tune faster.

hardin47 commented 2 years ago

In the code you link to (two comments up) you get "folds" from bootstrapping and in your tune_grid() you use those bootstrap folds.

why not tune_grid() using OOB? is there anything in tidymodels that will let you tune parameters using OOB? can i hack some of the caret functionality to do some OOB work?

it seems like double the (computational) work to do extra bootstrapping instead of letting the free OOB values provide model information.

thank you!!!

juliasilge commented 2 years ago

@hardin47 We don't super fluently getting those OOB samples out because we believe it is better practice to tune using a nested scheme, but if you want to see if it works out OK in your particular setting, you might want to check out this article for handling some of the objects/approaches involved.

hardin47 commented 2 years ago

@juliasilge The nested stuff is very cool, indeed. One might even say that it has advantages over simple cross validation, too. But it is going to be hard to understand the nested mechanism without first understanding cross validation (and, dare I say, OOB errors). Not having any OOB error analysis in tidymodels will make the package less useful in the classroom, and I worry that the disconnect will have negative fallout in a variety of ways. Just my two cents... although I'm going to make a feature request as a tidymodels issue. :)

juliasilge commented 2 years ago

@hardin47 I'm so glad you posted the issue; thank you 🙌

conlelevn commented 2 years ago

Hi Julia, Even though you have explained why don't we use step_log() for outcome variable in the video, but I still feel confuse here, does it make any different if we used step_log() rather than log() it before hand?

juliasilge commented 2 years ago

@conlelevn It's not different in the sense that you are log-transforming the outcome either way. It does make a difference in that you can run into problems when predicting on new data or tuning if you preprocess the outcome using feature engineering that is suited/designed for processing predictors.

conlelevn commented 2 years ago

@juliasilge juliasilge hmmm, its still sound weird to me since in Modeling GDPR violations screencast you also use step_log() the outcome variable rather than log() it before hand

juliasilge commented 2 years ago

@conlelevn Yes, we have realized that it is a bad idea to recommend that folks use a recipe to process the outcome. We have this in some early blogs and teaching materials but have realized it causes problems for many users; we no longer recommend this. You can read more about this topic here.

mrguyperson commented 1 year ago

I'm having issues with bag_mars() and dummy variables. I was under the impression that bag_mars() requires categorical variables to be converted to dummy variables using recipe(), but when I try to run it, all models fail due to an error that note all variables are present in the supplied training set. An example with fake data:

# set up fake data frame
length <- 1000
outcome1_prob <- 0.8
weight <- c("heavy", "light")

data_outcome_1 <- tibble(
  outcome = rnorm(n = length/2, mean = 3),
  weight = sample(weight, size = length/2, replace = TRUE, prob = c(outcome1_prob, 1 - outcome1_prob)),
  length = rnorm(n = length / 2, mean = 10)
)

data_outcome_2 <- tibble(
  outcome = rnorm(n = length/2, mean = 1),
  weight = sample(weight, size = length/2, replace = TRUE, prob = c(1 - outcome1_prob, outcome1_prob)),
  length = rnorm(n = length / 2, mean = 6)
)

data <- data_outcome_1 %>% bind_rows(data_outcome_2)

# train/test split
split <- initial_split(data, prop = 0.8)

training_data <- training(split)
testing_data <- testing(split)

# recipe for data
rec <- recipe(outcome ~ ., data = training_data) %>%
  step_dummy(all_nominal())

juiced <-
  rec %>%
  prep() %>%
  juice()

folds <-
  juiced %>%
  vfold_cv(v = 10)

# model specification
mars_spec <-
  bag_mars() %>%
  set_engine("earth", times = 25) %>%
  set_mode("regression")

# workflow
tune_wf <-
  workflow() %>%
  add_recipe(rec) %>%
  add_model(mars_spec)

# fit model with cross validation
res <- tune_wf %>%
  fit_resamples(
    resamples = folds,
    control = control_resamples(save_pred = TRUE, verbose = TRUE)
  )

It seems to run fine without step_dummy() but the documentation indicates that I should still use it. Any advice? Thank you.

juliasilge commented 1 year ago

@mrguyperson I think the problem is that you created folds from preprocessed data, not raw data. In tidymodels, you want to include both preprocessing (feature engineering) and model estimation together in a workflow(), and then apply it to your raw data, like folds <- vfold_cv(training_data). There is no need to use prep() because the workflow() takes care of it for you. You may want to look at:

mrguyperson commented 1 year ago

@juliasilge oh my gosh, that was it. Thank you so much!

juliasilge / juliasilge.com

Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge #21

Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge