juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Bootstrap resampling with #TidyTuesday beer production data | Julia Silge #41

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Bootstrap resampling with #TidyTuesday beer production data | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on beer production to show how to use bootstrap resampling to estimate model parameters.

https://juliasilge.com/blog/beer-production/

austinwpearce commented 2 years ago

Your tutorials, as well as those through tidymodels.org have been really helpful.

One problem I encounter though is when working with nonlinear models. When I try to apply an nls formula, to the bootstrapping framework, I don't know how to help it out when a fit fails to converge. And the only work around I tried creates objects of try-error class that functions like tidy() and augment() can't handle.

For example, https://www.tidymodels.org/learn/statistics/bootstrap/

I hope you don't mind me asking my question here. If this question is off topic, you can delete my comment and I'll ask it elsewhere.

juliasilge commented 2 years ago

@austinwpearce If you are going to use purrr::map() for this, I think the most natural way to handle failures is with the "adverbs" like possibly() and safely(). I think this is a nice explanation of how to set up that kind of error handling.

austinwpearce commented 2 years ago

Thank you! I had no idea!

By using possibly(... , otherwise = NULL) it provided a result for failed models that is acceptable to subsequent steps like tidy , augment, etc.

conlelevn commented 2 years ago

Hi julia,

In this dataset, you are using lm model to see the relatioship between 2 variables which are malt_and_malt_products and sugar_and_syrups. However, as far as I can see the malt_and_malt_products variable might not be normal distributed, dont you think that if we dont transform this variable it will make our coeff estimation biased and not reliable hence make our model flawn?

juliasilge commented 2 years ago

One great thing about the bootstrap is that it does not depend on those kinds of assumptions of a parametric model.

conlelevn commented 2 years ago

Hi Julia,

That's new to me, thanks for your information. I have another question: in this example, you used map() function for searching the coeff that have lowest RMSE value, can we call it a tunning? and can we use tune_grid() to do that instead of map() fucntion like another example that you have done previously?

juliasilge commented 2 years ago

I believe you can use fit_resamples() together with the extract argument to the control object to pull out the coefficients, similar to what I show here. The one thing I can't remember whether it works is the special formula sugar_and_syrups ~ 0 + malt_and_malt_products (no intercept).

conlelevn commented 2 years ago

Thanks Julia,

I have last few questions for this lm model:

  1. Lets say if I want tidymodel to show me the last model in form of y=a+b1x+b2x...+e, with all the basic information like p-value, sd...how can I do that?
  2. and if I would like to check whether the lm assumptions (i.e error homoskedasticity, uncorrelated...) has been hold, then can tidymodel support it?
juliasilge commented 2 years ago

@conlelevn You should look into the broom functions like tidy() and glance(), which give you basic information about a model fit. If you would like to use the regular plot() method for a linear model to see those usual diagnostic plots, you can extract_fit_engine() on your tidymodels result and then plot it. Alternatively, you can use something like the autoplot from ggfortify (also after using extract_fit_engine().

conlelevn commented 2 years ago

Thanks Julia I will definitely look at it

stephanbitterwolf commented 1 year ago

Hi Julia,

This is such a great guide for using the lm model. In other tutorials we are taught to create recipes where we can center and scale the data before it is used for the prediction. How could I bootstrap with models built with recipes? Thanks for any assistance you can provide.

Example: "model_recipe <- model_data%>% recipe(y ~ x+z+a)%>% step_center(all_numeric_predictors())%>% step_scale(all_numeric_predictors())%>% prep

lm_fit <- linear_reg() %>% set_engine("lm")%>% fit(y ~x+z+a, data=bake(model_recipe, new_data = NULL))

"

stephanbitterwolf commented 1 year ago

I found this:https://rsample.tidymodels.org/articles/Applications/Recipes_and_rsample.html

Would you recommend following that process?

juliasilge commented 1 year ago

@stephanbitterwolf If you have a whole workflow containing a recipe together with a model, then you probably want to use fit_resamples(). If your goal is to estimate model parameters like in this blog post, then you'll want to set up a special way to "extract" the quantities you are interested in.

acarpignani commented 7 months ago

Hi Julia, sorry for asking. I was trying to redo the following with tidyverse:

beer_models <- beer_boot %>%
  mutate(
    model = map(splits, ~ lm(sugar_and_syrups ~ 0 + malt_and_malt_products, data = .)),
    coef_info = map(model, tidy)
  )

but it gives me some error. I have attempted to do it this way:

lm_spec <- linear_reg() |> 
    set_engine("lm")

beer_wf <- workflow() |> 
    add_model(lm_spec) |> 
    add_formula(sugar_and_syrups ~ 0 + malt_and_malt_products)

beer_models <- beer_boot |> 
    mutate(
        model = map(splits, ~ fit(lm_spec, .)),
        coef_info = map(model, tidy)
    )

but it doesn't work. My aim is to replicate the graph with all fitted regression lines, so I don't think that simply changing the mutate statement to model = map(splits, ~ fit(lm_spec, analysis(.)) would help when it goes to augment.

Would you be able to help me out with this?

juliasilge commented 7 months ago

@acarpignani Because we are using zero intercept here, you'll need to pass in that model formula separately:

beer_wf <- workflow() |> 
    add_model(lm_spec, formula = sugar_and_syrups ~ 0 + malt_and_malt_products) |> 
    add_formula(sugar_and_syrups ~ malt_and_malt_products)

And then you would pass in a custom extract to the control argument like in this article:

fit_resamples(
  beer_wf,
  beer_boot,
  control = control_resamples(extract = my_custom_func)
)

Alternatively, you might consider using reg_intervals(sugar_and_syrups ~ 0 + malt_and_malt_products, brewing_materials, keep_reps = TRUE) from rsample, which is new since this blog post. You won't be able to use augment() since you get the coefficients but not the model object, but since this is just a linear model with a single predictor you can multiply.

acarpignani commented 7 months ago

@juliasilge thank you very much! I think I got really stuck wanting at all costs to build the beer_models with the mutate and map functions. Eventually, the fit_resample function does the same thing!

WUJINGSHU0914 commented 2 months ago

Hi Julia,

Thank you sooo much for such a great guide which helps me a lot.

I'm new to tidyverse and bootstrap methods, and I have two questions regarding calculating the bootstrap AUC for a multivariate logistic regression model: 1、For computing the bootstrap AUC, are there two approaches: one that corrects for optimism bias in parameter estimates (i.e., using the same logistic regression model with a fixed set of predictors in each bootstrap sample), and another that corrects for optimism in both variable selection and parameter estimation (i.e., performing variable selection and parameter estimation anew in each bootstrap sample)? 2、Can both types of bootstrap AUC be computed using the tidyverse packages (just like what is done in this blog)? For the first type, could I simply modify the code from your blog post to calculate the AUC? For the second type, is it possible to use methods other than automatic variable selection methods (such as stepwise) during model building?

Thank you very much for your valuable advice, and I apologize if my questions are a bit detailed.

juliasilge commented 2 months ago

@WUJINGSHU0914 Generally, we highly recommend that you consider any feature engineering (like supervised feature selection) as part of your model, to be evaluated via resampling together with model estimation.