Open forecastingEDs opened 1 year ago
Sorry for the delayed response - did you try the 'tidy' method on the extracted recipe object - like:
prepped |> tidy(id = "my_step_id", type = "scores")
hello @stevenpawley, Please help me!
This code does not provide the 'variable' and 'scores' columns containing the variable names and their information gain scores
. I can't extract this information after training my recipe
I will provide a reproducible example with my data, but if you prefer, show me an example of how to extract these variables and the respective scores for information gain with the modeldata
data.
reprex
*** Load the following R packages ----
library(remotes)
remotes::install_github("business-science/modeltime", dependencies = TRUE)
remotes::install_github("business-science/modeltime.ensemble")
remotes::install_github("tidymodels/recipes")
library(recipes)
library(tune)
library(keras)
library(modeltime.ensemble)
library(tidymodels)
library(modeltime)
library(lubridate)
library(tidyverse)
library(timetk)
library(tidyquant)
library(yardstick)
library(reshape)
library(plotly)
library(xgboost)
library(rsample)
library(targets)
library(tidymodels)
library(modeltime)
library(timetk)
library(tidyverse)
library(tidyquant)
library(LiblineaR)
library(parsnip)
library(ranger)
library(kknn)
library(readxl)
library(lifecycle)
library(skimr)
library(remotes)
remotes::install_github("tidymodels/bonsai")
library(bonsai)
library(lightgbm)
remotes::install_github("curso-r/treesnip")
library(treesnip)
library(rio)
library(devtools)
devtools::install_github("stevenpawley/recipeselectors")
devtools::install_github("stevenpawley/colino")
library(colino)
library(recipeselectors)
library(FSelectorRcpp)
library(care)
library(parsnip)
library(Boruta)
library(praznik)
library(parallel)
library(foreach)
library(doParallel)
library(RcppParallel)
data_tbl <- datasets %>%
select(id, Date, attendences, average_temperature, min, max, sunday, monday, tuesday, wednesday, thursday, friday, saturday, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
set_names(c("id", "date", "value","tempe_verage", "tempemin", "tempemax", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
data_tbl
full_data_tbl <- datasets %>%
select(id, Date, attendences, average_temperature, min, max, sunday, monday, tuesday, wednesday, thursday, friday, saturday, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
set_names(c("id", "date", "value","tempe_verage", "tempemin", "tempemax", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) %>%
group_by(id) %>%
future_frame(
.date_var = date,
.length_out = "45 days",
.bind_data = TRUE
) %>%
ungroup() %>%
mutate(id = fct_drop(id))
data_prepared_tbl <- full_data_tbl %>%
filter(!is.na(value))
future_tbl <- full_data_tbl %>%
filter(is.na(value))
data_prepared_tbl %>% glimpse()
** Summary Diagnostics. Let us check the regularity of all time series with timetk::tk_summary_diagnostics()
** Check for summary of timeseries data for training set
data_prepared_tbl %>%
group_by(id) %>%
tk_summary_diagnostics()
`emergency_tscv <- data_prepared_tbl %>%
time_series_cv(
date_var = date,
assess = "45 days",
skip = "30 days",
cumulative = TRUE,
slice_limit = 5
)
emergency_tscv`
# Information gain feature selection ----
recipe_spec <- recipe(value ~ .,
data = training(emergency_tscv$splits[[1]])) %>%
step_timeseries_signature(date) %>%
step_rm(matches("(.iso$)|(.xts$)|(.lbl$)|(hour)|(minute)|(second)|(am.pm)|(date_year$)")) %>%
step_normalize (date_index.num, date_mday7, date_week4, date_week3, date_week2, date_week,date_mweek, date_yday, date_qday, date_mday, date_wday, date_day,date_month,date_quarter,date_half, tempe_verage,tempemin,tempemax, -all_outcomes())%>%
step_select_infgain(all_predictors(), scores = TRUE, top_p = 17, outcome = "value") %>%
step_mutate(data = factor(value, ordered = TRUE))%>%
step_dummy(all_nominal(), one_hot = TRUE)
recipe_spec %>% prep() %>% juice() %>% glimpse()
wflw_fit_lightgbm <- workflow() %>%
add_model(
boost_tree("regression", min_n = tune(),
mtry = tune(),
trees = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune()) %>% set_engine("lightgbm", num.threads = 20)
) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
tune_grid(grid = 30, recipe_spec, resamples = emergency_tscv, control = control_grid(verbose = TRUE, parallel_over = "resamples", allow_par = TRUE),
metrics = metric_set(rmse)) # parallel_over = "everything"
wflw_fit_lightgbm_best_IG_45 <- workflow() %>%
add_model(
boost_tree("regression", min_n = 3,
mtry = 472,
trees = 1724,
tree_depth = 12,
learn_rate = 0.060244791,
loss_reduction = 0.030219957,
sample_size = 0.864302848) %>% set_engine("lightgbm")
) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(emergency_tscv$splits[[1]]))
wflw_fit_lightgbm_best_IG_45 %>%
extract_fit_parsnip() %>%
pull_importances()
No method for pulling feature importances is defined for _lgb.Booster
scores
functionAh, thanks probably because LightGBM as a model isn't supported yet for extracting feature importances. XGBoost is supported, but I didn't add the method for LGBM yet. You could try the vip
package on the extracted model object - although I'm not sure if it support LGBM yet either. I tend to still use XGBoost in R - that said, it should be easy to add a method and I'll take a look.
Hi,
yes LightGBM is not VIP
-enabled or your package, but that's not what I need. The step of selecting variables for information gain (IG) should show the selected variables and the respective scores
assigned by the IG with the scores
function, but I cannot extract this information. In order to be able to demonstrate the results of the selection of variables for information gain, I need to find the table with the variables and scores generated by the IG. Can you provide a reprex? The VIP
package will generate the importance of the variable by the VIP method and not by the selection of variables by IG.
Note: For the other features selection methods like boruta, MRMR, etc I also can't see the scores the method assigned to each variable except vip which has a pull_importance
function for that.
Grateful
Hello, @stevenpawley @topepo Can you please help me with this question?
Hello, I'm taking a look now. A few things that immediately stand out are that you shouldn't import both recipeselectors
and colino
, which replaces it, because that might really mess with things. I think that the same would apply to treesnip
and bonzai
- the latter replaces the former.
Aside from the lack of a pull_importances
method that supports LightGBM, you can get the what variables are removed by using:
wflw_fit_lightgbm_best_IG_45 %>%
extract_recipe() %>%
tidy(number = 4)
And if you want the scores:
wflw_fit_lightgbm_best_IG_45 %>%
extract_recipe() %>%
tidy(number = 4, type = "scores")
Hi @stevenpawley! Thank you for your time, it helped me a lot! I was able to generate the IG scores, but it is strange that 7 variables retained by the IG obtained 0.00000... points. Can you tell me if this is correct? Shouldn't the GI algorithm select only the variables that obtained positive scores?
Dear, @stevenpawley
I can't see the scores function table that features the 'variable' and 'scores' columns containing the variable names and their scores. Could you help me find these values?
Information gain feature selection ----
After training the model with the above preprocessing step the scores should be calculated but I can't find them. Please, can you help me. an example of how to view scores with the
scores
function would suffice.