utterances-bot commented 3 years ago

TidyTuesday hotel bookings and recipes | Julia Silge

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models!

https://juliasilge.com/blog/hotels-recipes/

jstello commented 3 years ago

Thank you for sharing these amazing techniques! I loved the skim function in particular. I got stuck on the Ggally part though, I wasn´t able to install it by running # Github library(devtools) install_github("ggobi/ggally").

I'm new to RStudio, but I hope to learn more from your amazing videos. Cheers,

juliasilge commented 3 years ago

@jstello Try installing it straight from CRAN via install.packages("GGally")

ntihemuka commented 3 years ago

hey julia, how do you get your code to look so neat and formatted? is there an r studio functionality that helps format your code as you type?

ntihemuka commented 3 years ago

Error: The first argument to [fit_resamples()] should be either a model or workflow.

I dont know how to shake this error? even when i copy your code exactly

juliasilge commented 3 years ago

@ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here. The other thing I do is try to follow tidyverse style most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org.

ntihemuka commented 3 years ago

thanks!

On Mon, May 24, 2021 at 4:07 PM Julia Silge @.***> wrote:

@ntihemuka https://github.com/ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. The other thing I do is try to follow tidyverse style https://style.tidyverse.org/ most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org https://www.tidymodels.org/start/case-study/.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/26#issuecomment-847107993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKKHQUPTL7UUVSKREC3TPJTSZANCNFSM44BZZQDA .

gunnergalactico commented 2 years ago

Hi Dr. Silge,

I tried this example from the website https://www.tidymodels.org/start/case-study/ and noticed an issue with the engine arguments. It appears you can't pass engine specific arguments like "num.threads" or "importance = impurity" with the new workflow syntax. It does work with the old set_engine syntax.

gunnergalactico commented 2 years ago

hotel_stays

juliasilge commented 2 years ago

@gunnergalactico That is correct and as expected; you can only set engine-specific arguments within set_engine().

nguyenlovesrpy commented 2 years ago

Hi, I just think that Knn is only for classification in trainining data, and It shouldn't be used to predict for a new dataset (testing data). What do you think about it? Thank you and Best regards

juliasilge commented 2 years ago

@nguyenlovesrpy A nearest neighbor model can definitely be used to predict for a new dataset; check out examples here for both regression and classification.

Cidree commented 1 year ago

Hello. First of all thank you for all these videos, there are really helpful!

I have a question about the outcome in the confusion matrix. What are we evaluating exactly? Because when I sum the observations in the CF there are 22,900 observations, whereas the test set has 18,792 and the training set has 56,374. Why is this?

Cidree commented 1 year ago

Hello again. I think I figured it out. It is because of the Monte Carlo CV which uses in this case as validation 10% of the data 25 times, so we have 250% of observations of the training set.

juliasilge commented 1 year ago

Yep, those predictions that are used in the confusion matrix are from the 25-fold resampling, where the predictions are on the held out (or "assessment") observations in each resample. You may be interested in trying out the conf_mat_resampled() function.

ghost commented 1 year ago

Hi Julia, how the knn model estimate the correct k neighbors? Does model use a default value?

juliasilge commented 1 year ago

@rcientificos You can check out details like that in the documentation for nearest_neighbor().

ghost commented 1 year ago

Thank you.!. What is alternative for step_downsample in recipes? or I have to use themis package?

juliasilge commented 1 year ago

@rcientificos Yes, that's right. The function from step_downsample() moved from recipes to themis.

RaymondBalise commented 5 months ago

Hello Julia,

I noticed that you use the juiced data when you make the resamples in this vlog:

mc_cv(juice(hotel_rec), prop = 0.9, strata = children)

Am I correct that, to avoid leakage caused by step_normalize() in the recipe, it would be best to feed mc_cv() the unprocessed hotel_train data and then use the recipe when you fit the resamples?

It is a small point but I am thinking this is the modern simple example code:

# I changed juiced preped data to be the full untrained data
validation_splits <- mc_cv(hotel_train, prop = 0.9, strata = children)  

knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")

hotel_rec <- recipe(children ~ ., data = hotel_train) %>%
  step_downsample(children) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) 

# use full recipe and unprocessed resampled data  
knn_res <- fit_resamples(
  knn_spec,
  hotel_rec,  # use full recipe here vs just children ~ .,
  validation_splits,  # not pre-baked splits
  control = control_resamples(save_pred = TRUE)
)

Do I have this right?

juliasilge commented 5 months ago

Yes @RaymondBalise that's right. You can see that the article here using the same hotel data takes an approach more like what you describe than what I have here.

juliasilge / juliasilge.com

#TidyTuesday hotel bookings and recipes | Julia Silge #26

TidyTuesday hotel bookings and recipes | Julia Silge