juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
41 stars 27 forks source link

Use racing methods to tune xgboost models and predict home runs | Julia Silge #42

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Use racing methods to tune xgboost models and predict home runs | Julia Silge

Models like xgboost have many tuning hyperparameters, but racing methods can help identify parameter combinations that are not performing well.

https://juliasilge.com/blog/baseball-racing/

daver787 commented 3 years ago

I thought my computer was fast but tune_race_anova() showed me otherwise.

JunaidMB commented 3 years ago

Hi Julia,

When I run the tune_race_anova function I get the following error:

Creating pre-processing data to finalize unknown parameter: mtry
Racing will minimize the mn_log_loss metric.
Resamples are analyzed in a random order.
Error: There were no valid metrics for the ANOVA model.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
All models failed. See the `.notes` column. 

What am I doing wrong? I've followed the tutorial step by step so far so I suspect there is an issue with dependencies here?

juliasilge commented 3 years ago

@JunaidMB Hmmmmm, there are two things that come to mind: I know I was using the development version of dials from GitHub and there was a very recent version of finetune released to CRAN. I'd check to make sure you have both of those installed. I really have got to start adding session info to my blog posts. 😬

NickyDy commented 3 years ago

Hi Julia, It's a very useful tutorial. However, I wanted to point out that you've missed a "scales::" in the second code chunk, just before "percent" in the forth line. :)

kamaulindhardt commented 3 years ago

Hi @julia and @JuniadMB,

I also experienced the exact same error in my workflow set tuning and I don't understand why?

wflwset_setup <- workflow_set(
  preproc = list(
    normalized = recipe_normal,
    rm_corr = recipe_corr, 
    rm_unbalan = recipe_remove, 
    impute_mean = recipe_impute_mean, 
    impute_knn = recipe_impute_knn
  ),
  models = list(
    lm = lm_model.wf,
    glm = glm_model.wf,
    spline = spline_model.wf,
    knn = knn_model.wf,
    svm = svm_model.wf,
    RF = rf_model.wf,
    XGB = xgb_model.wf,
    CatB = catboost_model.wf
  ),
  cross = TRUE
)
```{r Tuning workflowset}
set.seed(579)

if (exists("wflwset_tune_results_cv")) rm("wflwset_tune_results_cv")

# Initializing parallel processing 
doParallel::registerDoParallel()

# Workflowset tuning

wflwset_tune_results_cv <- wflwset_setup %>%
  workflowsets::workflow_map(
    fn        = "tune_race_anova",
    resamples = cv.fold.wf,
    grid      = 15,
    metrics   = multi.metric.wf, #
    verbose   = TRUE
  )

# Terminating parallel session
parallelStop()
i   No tuning parameters. `fit_resamples()` will be attempted
i  1 of 35 resampling: normalized_lm
Warning: All models failed. See the `.notes` column.
x  1 of 35 resampling: normalized_lm failed with preprocessor 1/1, model 1/1: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
i  2 of 35 tuning:     normalized_glm
Warning: All models failed. See the `.notes` column.
x  2 of 35 tuning:     normalized_glm failed with: There were no valid metrics for the ANOVA model.
i   No tuning parameters. `fit_resamples()` will be attempted
i  3 of 35 resampling: normalized_knn
Warning: All models failed. See the `.notes` column.
x  3 of 35 resampling: normalized_knn failed with preprocessor 1/1, model 1/1: Error in best[1, 2]: subscript out of bounds
i   No tuning parameters. `fit_resamples()` will be attempted
i  4 of 35 resampling: normalized_svm
Warning: All models failed. See the `.notes` column.
x  4 of 35 resampling: normalized_svm failed with preprocessor 1/1, model 1/1: Error in if (any(co)) {: missing value where TRUE/FALSE needed
i  5 of 35 tuning:     normalized_RF
i Creating pre-processing data to finalize unknown parameter: mtry
juliasilge commented 3 years ago

@kamaulindhardt It looks like your models are failing, just to fit in the first place (which is why you can't then do an ANOVA model on the results). I would try fitting some of those workflows individually outside of the workflowset, to debug which one is the problem and why.

kamaulindhardt commented 3 years ago

Thank you @juliasilge,

I am trying to fit the individual models separately and find it difficult to interpret the issue. As the error messages are, for example here with my knn model: "Error: Problem with `mutate()` column `.row`. ℹ `.row = orig_rows`. ℹ `.row` must be size 37 or 1, not 40." What does that mean? I cannot find information online.

From the recipe:

base_recipe <- 
  recipe(formula = logRR ~ ., data = af.train.wf) %>%
  update_role(Latitude,
              Longitude,
              new_role = "sample ID") %>% 
  step_zv(all_predictors(), skip = TRUE) %>% # remove any columns with a single unique value
  step_normalize(all_numeric_predictors(), skip = TRUE) # normalize numeric data: standard deviation of one and a mean of zero.

filter_recipe <- 
   base_recipe %>% 
   step_corr(all_numeric_predictors(), threshold = 0.8, skip = TRUE)

Model spec

knn_spec <- 
   nearest_neighbor(neighbors = tune(), 
                    weight_func = tune()) %>% 
   set_engine("kknn") %>% 
   set_mode("regression")

Model tuning with tune_grid()

knn_fit <- tune_grid(knn_spec,
              preprocessor = filter_recipe,
              resamples = cv.fold.wf,
              metrics = multi.metric.wf)

knn_fit

Error(s):

Warning: This tuning result has notes. Example notes on model fitting include:
preprocessor 1/1, model 5/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 37 or 1, not 40.
preprocessor 1/1, model 1/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 37 or 1, not 40.
preprocessor 1/1, model 2/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 39 or 1, not 40.
# Tuning results
# 10-fold cross-validation 
juliasilge commented 3 years ago

It's hard to say without a reprex but I am guessing your problem is using skip = TRUE here, where you are not applying some steps to new data. You can check out this discussion of what skipping steps for new data means.

kamaulindhardt commented 3 years ago

I now added an imputation stepstep_impute_mean(all_predictors()) in the recipe, and that seems to work:

base_recipe <- 
  recipe(formula = logRR ~ ., data = af.train.wf) %>%
  step_impute_mean(all_predictors())
  update_role(Latitude,
              Longitude,
              new_role = "sample ID") %>% 
  step_zv(all_predictors(), skip = TRUE) %>% # remove any columns with a single unique value
  step_normalize(all_numeric_predictors(), skip = TRUE) # normalize numeric data: standard deviation of one and a mean of zero.

filter_recipe <- 
   base_recipe %>% 
   step_corr(all_numeric_predictors(), threshold = 0.8, skip = TRUE)

How come Random Forest and kNN models cannot cope with missing values? I thought at least RF was desifor dealing with missing values. On the other hand my XGBoost models don't seem to be bothered..(?)

Thank you!

juliasilge commented 3 years ago

@kamaulindhardt Again, it's hard to say without a reprex but now it's look to me that you aren't using anything past step_impute_mean() because you don't have a %>% at the end of that line. This model is probably succeeding because you are no longer trying to use the skip = TRUE steps; using skip = TRUE for steps like step_normalize() is a pretty bad idea. I suggest reading through the sections I linked above to understanding what skipping steps for new data means.

I also recommend creating a small, self-contained reproducible example to ask for help. Truly, people are just guessing if you don't do this. I know that creating a reprex can feel like a lot of work, but we have found that it is really the only way for someone who needs help online to reliably get the right answer. If you ask a question online without creating a reprex, think of yourself as just blindly flailing in the dark; when you ask a question with creating a reprex that demonstrates your problem, then think of yourself as having given people the tools to help you.

data-datum commented 2 years ago

Hi Julia, I would like to know how to unfold the folds created with vfold_cv; for better inspection what samples are in each fold. Thanks

juliasilge commented 2 years ago

@data-datum You might find it helpful to use the tidy() method, or to check out this article on handling rset objects for examples on how to call analysis(). Or you can manually get the indices out; they are in in_id:

library(tidyverse)
library(rsample)

car_folds <- vfold_cv(mtcars, v = 3)
map(car_folds$splits, "in_id")
#> [[1]]
#>  [1]  1  2  3  5  9 11 12 14 15 16 17 18 21 22 23 24 25 26 27 31 32
#> 
#> [[2]]
#>  [1]  1  2  4  6  7  8  9 10 11 12 13 14 17 19 20 22 23 28 29 30 32
#> 
#> [[3]]
#>  [1]  3  4  5  6  7  8 10 13 15 16 18 19 20 21 24 25 26 27 28 29 30 31

Created on 2021-10-28 by the reprex package (v2.0.1)

tsengj commented 2 years ago

I too have the same issue when using racing to tune a few models

race_ctrl <-
  control_race(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE,
    verbose = TRUE,
    pkgs = c('stringr')
  )

race_results <-
system.time(  
     all_workflows %>%
     workflow_map(
       "tune_race_anova",
       seed = 1503,
       resamples = vfolds,
       grid = 25,
       verbose = TRUE,
       control = race_ctrl
     )
 )
i 1 of 8 tuning:     pca_norm_recipe_RF
i Creating pre-processing data to finalize unknown parameter: mtry
*** recursive gc invocation
Warning: stack imbalance in 'lapply', 154 then 152
x 1 of 8 tuning:     pca_norm_recipe_RF failed with: There were no valid metrics for the ANOVA model.
i 2 of 8 tuning:     pca_norm_recipe_boosting

Only successful when i switch out the racing to the standard tune_grid

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE,
    pkgs = c('stringr')
  )

full_results_time <-
  system.time(
    grid_results <-
      all_workflows %>%
      workflow_map(
        seed = 1503,
        resamples = vfolds,
        grid = 25,
        control = grid_ctrl,
        verbose = TRUE
      )
  )

i 1 of 8 tuning:     pca_norm_recipe_RF
i Creating pre-processing data to finalize unknown parameter: mtry
v 1 of 8 tuning:     pca_norm_recipe_RF (21m 29.6s)
i 2 of 8 tuning:     pca_norm_recipe_boosting
juliasilge commented 2 years ago

Wow @tsengj I have not seen a garbage collection error from these functions. Can you create a reprex (a minimal reproducible example) for this and post it on the finetune repo? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

tsengj commented 2 years ago

@juliasilge Turns out that removing the line "pkgs = c('stringr')" from control_race fixed the error above. The stringr package was a simple step_mutate recipe which does "postcode = as.numeric(str_sub(suburb,-4,-1)". Excluding that from the recipe resolved the issue above. I haven't had the opportunity to raise a reprex in the finetune repo. Doesn't appear as though finetune supports loading of package yet. I utilise parallel processing (doParallel)

wdkeyzer commented 2 years ago

Hi Julia,

Thank you for your valued contributions! When I run the tune_race_anova function on a workflow containing an XGboost model I also get the following error: min_preproc_xgboost failed with: There were no valid metrics for the ANOVA model. All other models are ok. I've been able to run XGboost on the same machine using the approach below and it worked fine then. I have a hard time debugging this one, do you have any ideas at what might cause this error? I've made a reprex using the diamonds dataset and session info (hope it's done correctly as this is my first reprex).

Any help is much appreciated.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)
library(here)
#> here() starts at /private/var/folders/pw/540tsbnx2r3gtmk605nm1fsc0000gn/T/RtmpTkNsqu/reprex-381939e26d6f-sand-viper
library(baguette)
library(rules)
#> 
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#> 
#>     max_rules
library(finetune)
library(dials)

options(tidymodels.dark = TRUE)
doParallel::registerDoParallel()

carat <- diamonds %>% 
  select(price, cut, carat, clarity)

## Build models

set.seed(123)
carat_split <- initial_split(carat, strata = price)
carat_train <- training(carat_split)
carat_test <- testing(carat_split)

set.seed(234)
carat_folds <- vfold_cv(carat_train, strata = price)
carat_folds
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits               id    
#>    <list>               <chr> 
#>  1 <split [36405/4048]> Fold01
#>  2 <split [36406/4047]> Fold02
#>  3 <split [36407/4046]> Fold03
#>  4 <split [36408/4045]> Fold04
#>  5 <split [36408/4045]> Fold05
#>  6 <split [36408/4045]> Fold06
#>  7 <split [36408/4045]> Fold07
#>  8 <split [36409/4044]> Fold08
#>  9 <split [36409/4044]> Fold09
#> 10 <split [36409/4044]> Fold10

ranger_spec <-
  rand_forest(trees = 1e3, min_n = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("regression")

xgb_spec <- boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
                       min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

cubist_spec <- cubist_rules(committees = tune(), neighbors = tune()) %>% 
  set_engine("Cubist") %>% 
  set_mode("regression")

base_rec <- recipe(formula  = price ~ carat + cut + clarity,
                   data = carat_train) %>% 
  step_string2factor(cut, clarity)

min_pre_proc <- 
  workflow_set(
    preproc = list(min_preproc = base_rec), 
    models = list(RF = ranger_spec, xgboost = xgb_spec, Cubist = cubist_spec)
  )

## Evaluate models

race_ctrl <-
  control_race(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )

race_results_carat <- 
  min_pre_proc %>% 
  workflow_map("tune_race_anova",
               seed = 1503, 
               resamples = carat_folds,
               grid = 25, 
               control = race_ctrl, 
               verbose = TRUE)
#> i 1 of 3 tuning:     min_preproc_RF
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✓ 1 of 3 tuning:     min_preproc_RF (4m 25.9s)
#> i 2 of 3 tuning:     min_preproc_xgboost
#> Warning: All models failed. See the `.notes` column.
#> x 2 of 3 tuning:     min_preproc_xgboost failed with: There were no valid metrics for the ANOVA model.
#> i 3 of 3 tuning:     min_preproc_Cubist
#> ✓ 3 of 3 tuning:     min_preproc_Cubist (3m 46.4s)

Created on 2022-01-31 by the reprex package (v2.0.1)

Session info ``` r sessionInfo() #> R version 4.1.2 (2021-11-01) #> Platform: aarch64-apple-darwin20 (64-bit) #> Running under: macOS Monterey 12.1 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib #> #> locale: #> [1] nl_BE.UTF-8/nl_BE.UTF-8/nl_BE.UTF-8/C/nl_BE.UTF-8/nl_BE.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] Cubist_0.3.0 lattice_0.20-44 xgboost_1.5.0.2 ranger_0.13.1 #> [5] vctrs_0.3.8 rlang_0.4.12 finetune_0.1.0 rules_0.1.2 #> [9] baguette_0.1.1 here_1.0.1 forcats_0.5.1 stringr_1.4.0 #> [13] readr_2.1.1 tidyverse_1.3.1 yardstick_0.0.9 workflowsets_0.1.0 #> [17] workflows_0.2.4 tune_0.1.6 tidyr_1.1.4 tibble_3.1.6 #> [21] rsample_0.1.1 recipes_0.1.17 purrr_0.3.4 parsnip_0.1.7 #> [25] modeldata_0.1.1 infer_1.0.0 ggplot2_3.3.5 dplyr_1.0.7 #> [29] dials_0.0.10 scales_1.1.1 broom_0.7.11 tidymodels_0.1.4 #> #> loaded via a namespace (and not attached): #> [1] minqa_1.2.4 colorspace_2.0-2 ellipsis_0.3.2 class_7.3-19 #> [5] rprojroot_2.0.2 fs_1.5.2 rstudioapi_0.13 listenv_0.8.0 #> [9] furrr_0.2.3 earth_5.3.1 mvtnorm_1.1-3 prodlim_2019.11.13 #> [13] fansi_1.0.2 lubridate_1.8.0 xml2_1.3.3 codetools_0.2-18 #> [17] splines_4.1.2 doParallel_1.0.16 libcoin_1.0-9 knitr_1.37 #> [21] Formula_1.2-4 jsonlite_1.7.3 nloptr_1.2.2.3 pROC_1.18.0 #> [25] dbplyr_2.1.1 compiler_4.1.2 httr_1.4.2 backports_1.4.1 #> [29] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0 cli_3.1.1 #> [33] prettyunits_1.1.1 htmltools_0.5.2 tools_4.1.2 partykit_1.2-15 #> [37] gtable_0.3.0 glue_1.6.0 reshape2_1.4.4 Rcpp_1.0.8 #> [41] cellranger_1.1.0 DiceDesign_1.9 nlme_3.1-152 iterators_1.0.13 #> [45] inum_1.0-4 timeDate_3043.102 gower_0.2.2 xfun_0.29 #> [49] globals_0.14.0 lme4_1.1-27.1 rvest_1.0.2 lifecycle_1.0.1 #> [53] future_1.23.0 MASS_7.3-54 ipred_0.9-12 hms_1.1.1 #> [57] parallel_4.1.2 yaml_2.2.1 C50_0.1.5 TeachingDemos_2.12 #> [61] rpart_4.1-15 stringi_1.7.6 highr_0.9 plotrix_3.8-2 #> [65] foreach_1.5.1 lhs_1.1.3 boot_1.3-28 hardhat_0.1.6 #> [69] lava_1.6.10 pkgconfig_2.0.3 evaluate_0.14 tidyselect_1.1.1 #> [73] parallelly_1.30.0 plyr_1.8.6 magrittr_2.0.1 R6_2.5.1 #> [77] generics_0.1.1 DBI_1.1.2 pillar_1.6.4 haven_2.4.3 #> [81] withr_2.4.3 survival_3.2-13 nnet_7.3-16 future.apply_1.8.1 #> [85] modelr_0.1.8 crayon_1.4.2 utf8_1.2.2 tzdb_0.2.0 #> [89] rmarkdown_2.11 grid_4.1.2 readxl_1.3.1 data.table_1.14.2 #> [93] plotmo_3.6.1 reprex_2.0.1 digest_0.6.29 GPfit_1.0-8 #> [97] munsell_0.5.0 ```
juliasilge commented 2 years ago

@wdkeyzer xgboost models require only numeric predictors; they can't handle any predictors like diamonds$clarity or diamonds$cut. You can check out this appendix for more info on preprocessing needed for different models.

Also, if you ever run into trouble with a workflow set like this, I recommend trying to just plain fit the workflow on your training data, or use tune_grid(). You will likely get a better understanding of where the problems are.

wdkeyzer commented 2 years ago

thank you @juliasilge for your help! I've came across the appendix before but didn't think about that... . Regarding plain fit and tune_grid(), that's a pro tip that should improve my problem solving in future. Thank you for pointing this out.

pspangler1 commented 2 years ago

Hi Julia, In the section where you describe "Let’s use last_fit() to fit one final time to the training data and evaluate one final time on the testing data." What in the code is demonstrating the model is being used on the test set? For example :

collect_predictions(xgb_last) %>% mn_log_loss(is_home_run, .pred_HR)

juliasilge commented 2 years ago

@pspangler1 It's this code, where we use last_fit():

xgb_last <- xgb_wf %>%
  finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %>%
  last_fit(bb_split)

If you look at the number of predictions that are coming out of collect_predictions(xgb_last) you'll notice it is the number of observations in the test set.

cseibold47 commented 2 years ago

Is there a way to also get predictions for the training set?

juliasilge commented 2 years ago

@cseibold47 We recommend against repredicting the training set for most typical use cases but you can use predict() with a fitted model on any data, which could include the training set.

jtag04 commented 1 year ago

Hi Julia, would you be able to tell me... in the tune_race_anova step.... I know you say it's doing ANOVA to determine which aren't likely to be winners... but is it comparing the models using roc_auc or mean_log_loss?

juliasilge commented 1 year ago

@jtag04 You can read more about this in the docs but the default is to use the first entry in the default metrics() for your model. You can instead specify a different metric to use via the metrics argument.

jtag04 commented 1 year ago

Thanks Julia, that's a big help

On Sat, Oct 15, 2022 at 2:27 PM Julia Silge @.***> wrote:

@jtag04 https://github.com/jtag04 You can read more about this in the docs https://finetune.tidymodels.org/reference/tune_race_anova.html but the default is to use the first entry in the default metrics() for your model. You can instead specify a different metric to use via the metrics argument.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1279646194, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEO6QUJHKDOLUL4SSGTWDIQDJANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>

jtag04 commented 1 year ago

Hi Julia, I think I've run into a bug using finetune::tune_sim_anneal() https://github.com/tidymodels/dials/issues/258 Is this something you've encountered before?

juliasilge commented 1 year ago

@jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page.

jtag04 commented 1 year ago

Yeah, totally, creating a reprex is going to take a little bit of doing -as the model/workflow contains sensitive data. I'll totally give it a shot if I don't hear from Max Kuhn in the coming days. Was hoping I might get lucky and someone would recognise what was going on. Has got me miffed. Cheers

On Fri, Oct 21, 2022 at 2:31 AM Julia Silge @.***> wrote:

@jtag04 https://github.com/jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex https://reprex.tidyverse.org/ (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help https://www.tidyverse.org/help/ page.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1285752752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEIPX7KWBXQO3GGSGYTWEFQWDANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>

jtag04 commented 1 year ago

Hi Julia, Have added a reprex to that Dials package issue I've logged. Hopefully that's some help. Cheers, Julian

On Fri, Oct 21, 2022 at 7:55 AM Julian Tagell @.***> wrote:

Yeah, totally, creating a reprex is going to take a little bit of doing -as the model/workflow contains sensitive data. I'll totally give it a shot if I don't hear from Max Kuhn in the coming days. Was hoping I might get lucky and someone would recognise what was going on. Has got me miffed. Cheers

On Fri, Oct 21, 2022 at 2:31 AM Julia Silge @.***> wrote:

@jtag04 https://github.com/jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex https://reprex.tidyverse.org/ (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help https://www.tidyverse.org/help/ page.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1285752752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEIPX7KWBXQO3GGSGYTWEFQWDANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>

Tadge-Analytics commented 1 year ago

Hey @juliasilge, I do recognise that we're in "open-source world"... but is there any special way of getting some attention to that Dials issue I've raised?

juliasilge commented 1 year ago

@Tadge-Analytics I don't believe there are any more steps to take, no. If I were to offer any advice, it would be to try to make a smaller reprex to share, with only the smallest amount of code that generates your problem. You can read more about that here. Also, once you have a minimal reprex, you could post on RStudio Community, to see if anyone there has seen this problem.

jtag04 commented 1 year ago

Thanks for your help in getting that issue of mine some attention, Julia. So glad I was able to work a solution.

I was also just wondering if you've ever made use of the metaflow r package in your machine learning travels?

The concept sounds excellent... But doesn't look like it has all that much of a following.

On Wed, 26 Oct 2022, 3:27 am Julia Silge, @.***> wrote:

@Tadge-Analytics https://github.com/Tadge-Analytics I don't believe there are any more steps to take, no. If I were to offer any advice, it would be to try to make a smaller reprex to share, with only the smallest amount of code that generates your problem. You can read more about that here https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html. Also, once you have a minimal reprex, you could post on RStudio Community https://rstd.io/tidymodels-community, to see if anyone there has seen this problem.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1290832862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEPQ7ECD5UG2JBW3QXDWFAC75ANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>

juliasilge commented 1 year ago

@jtag04 I've experimented with metaflow a little, just to see what it does, but I haven't used it in any real applications. I'm not sure it's a good fit for the type of approach I usually take.

QizhiSu commented 1 year ago

Dear all, I have experienced the same problem.

I have noticed 2 things:

  1. When I used step_pca and wanted to tune the threshold parameter, then tune_race_anova failed (but tune grid is fine); and if I didn't tune the threshold parameter, it was fine.
  2. using steps from the recipeselectors falied all the time (Error in test_paramters_gls(res, control$alpha): There were no valid metrics for the ANOVA model), regardless if I tuned threshold parameter or not.

I hope these findings are helpful in tackling this problem.

jrwalker-projects commented 3 months ago

Thanks for doing this - most instructive. A small tweak to the hex diagram shows the difference between left and right-handed hitting zones: train_raw %>% mutate(lefty_batter = if_else(is_batter_lefty==1, "Lefty", "Righty")) %>% ggplot(aes(plate_x, plate_z, z = is_home_run)) + stat_summary_hex(alpha = 0.8, bins = 10) + scale_fill_viridis_c() + labs(fill = "% home runs") + facet_grid(cols = vars(lefty_batter))