stevenpawley / colino

Recipes Steps for Supervised Filter-Based Feature Selection
https://stevenpawley.github.io/colino/
Other
34 stars 5 forks source link

How to view table with score and selected variables with information gain feature selection? #5

Open forecastingEDs opened 1 year ago

forecastingEDs commented 1 year ago

Dear, @stevenpawley

I can't see the scores function table that features the 'variable' and 'scores' columns containing the variable names and their scores. Could you help me find these values?

Information gain feature selection ----

recipe_spec <- recipe(value ~ ., 
                      data = training(emergency_tscv$splits[[1]])) %>%
  step_timeseries_signature(date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(.lbl$)|(hour)|(minute)|(second)|(am.pm)|(date_year$)")) %>%
  step_normalize (date_index.num,tempe_verage,tempemin,tempemax, -all_outcomes())%>%
  step_select_infgain(all_predictors(), top_p = 25, outcome = "value") %>%
  step_mutate(data = factor(value, ordered = TRUE))%>%
  step_dummy(all_nominal(), one_hot = TRUE)
# Model 1: Xgboost ----
wflw_fit_xgboost <- workflow() %>%
  add_model(
    boost_tree("regression") %>% set_engine("xgboost") 
  ) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(emergency_tscv$splits[[1]]))

After training the model with the above preprocessing step the scores should be calculated but I can't find them. Please, can you help me. an example of how to view scores with the scores function would suffice.

stevenpawley commented 1 year ago

Sorry for the delayed response - did you try the 'tidy' method on the extracted recipe object - like:

prepped |> tidy(id = "my_step_id", type = "scores")
forecastingEDs commented 1 year ago

hello @stevenpawley, Please help me!

This code does not provide the 'variable' and 'scores' columns containing the variable names and their information gain scores. I can't extract this information after training my recipe

I will provide a reproducible example with my data, but if you prefer, show me an example of how to extract these variables and the respective scores for information gain with the modeldata data.

reprex

Link to download the database used:

https://github.com/forecastingEDs/Forecasting-of-admissions-in-the-emergency-departments/blob/131bd23723a39724ad4f88ad6b8e5a58f42a7960/datasets.xlsx

Repex reproducible example

*** Load the following R packages ----

library(remotes)
remotes::install_github("business-science/modeltime", dependencies = TRUE)
remotes::install_github("business-science/modeltime.ensemble")
remotes::install_github("tidymodels/recipes")
library(recipes)
library(tune)
library(keras)
library(modeltime.ensemble)
library(tidymodels)
library(modeltime)
library(lubridate)
library(tidyverse)
library(timetk)
library(tidyquant)
library(yardstick)
library(reshape)
library(plotly)
library(xgboost)
library(rsample)
library(targets)
library(tidymodels)
library(modeltime)
library(timetk)
library(tidyverse)
library(tidyquant)
library(LiblineaR)
library(parsnip)
library(ranger)
library(kknn)
library(readxl)
library(lifecycle)
library(skimr) 
library(remotes)
remotes::install_github("tidymodels/bonsai")
library(bonsai)
library(lightgbm)
remotes::install_github("curso-r/treesnip")
library(treesnip)
library(rio)
library(devtools) 
devtools::install_github("stevenpawley/recipeselectors")
devtools::install_github("stevenpawley/colino")
library(colino)
library(recipeselectors)
library(FSelectorRcpp)
library(care)
library(parsnip)
library(Boruta)
library(praznik)
library(parallel) 
library(foreach)
library(doParallel)
library(RcppParallel)

Preparing data for preprocessing with recipe

data_tbl <- datasets %>%
  select(id, Date, attendences, average_temperature, min, max,  sunday, monday, tuesday, wednesday, thursday, friday, saturday, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
  set_names(c("id", "date", "value","tempe_verage", "tempemin", "tempemax", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

data_tbl

Full = Training + Forecast Datasets

full_data_tbl <- datasets %>%
  select(id, Date, attendences, average_temperature, min, max,  sunday, monday, tuesday, wednesday, thursday, friday, saturday, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
  set_names(c("id", "date", "value","tempe_verage", "tempemin", "tempemax", "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) %>%

Apply Group-wise Time Series Manipulations

group_by(id) %>%
  future_frame(
    .date_var   = date,
    .length_out = "45 days",
    .bind_data  = TRUE
  ) %>%
  ungroup() %>%

Consolidate IDs

mutate(id = fct_drop(id))

Training Data

data_prepared_tbl <- full_data_tbl %>%
  filter(!is.na(value))

Forecast Data

future_tbl <- full_data_tbl %>%
  filter(is.na(value))

data_prepared_tbl %>% glimpse()

** Summary Diagnostics. Let us check the regularity of all time series with timetk::tk_summary_diagnostics()

** Check for summary of timeseries data for training set

data_prepared_tbl %>%
  group_by(id) %>%
  tk_summary_diagnostics()

Data Splitting ----

Now we set aside the future data (we would only need that later when we make forecast)

And focus on training data

* 4.1 Panel Data Splitting ----

Split the dataset into analyis/assessment set

`emergency_tscv <- data_prepared_tbl %>%
  time_series_cv(
    date_var    = date, 
    assess      = "45 days",
    skip        = "30 days",
    cumulative  = TRUE,
    slice_limit = 5
  )

emergency_tscv`

Feature Selection and preprocessing ----

Information gain feature selection ----

# Information gain feature selection ----
recipe_spec <- recipe(value ~ ., 
                      data = training(emergency_tscv$splits[[1]])) %>%
  step_timeseries_signature(date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(.lbl$)|(hour)|(minute)|(second)|(am.pm)|(date_year$)")) %>%
  step_normalize (date_index.num, date_mday7, date_week4, date_week3, date_week2, date_week,date_mweek, date_yday, date_qday, date_mday, date_wday, date_day,date_month,date_quarter,date_half, tempe_verage,tempemin,tempemax, -all_outcomes())%>%
  step_select_infgain(all_predictors(), scores = TRUE, top_p = 17, outcome = "value") %>%
  step_mutate(data = factor(value, ordered = TRUE))%>%
  step_dummy(all_nominal(), one_hot = TRUE)

recipe_spec %>% prep() %>% juice() %>% glimpse()

Model 1: grid search LightGBM ----

wflw_fit_lightgbm <- workflow() %>%
  add_model(
    boost_tree("regression", min_n = tune(),
               mtry = tune(),
               trees = tune(),
               tree_depth = tune(),
               learn_rate = tune(),
               loss_reduction = tune(),
               sample_size = tune()) %>% set_engine("lightgbm", num.threads = 20) 
  )  %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  tune_grid(grid = 30, recipe_spec, resamples = emergency_tscv, control = control_grid(verbose = TRUE, parallel_over = "resamples", allow_par = TRUE),
            metrics = metric_set(rmse))  # parallel_over = "everything"

best model candidate selected

wflw_fit_lightgbm_best_IG_45 <- workflow() %>%
  add_model(
    boost_tree("regression", min_n = 3,
               mtry = 472,
               trees = 1724,
               tree_depth = 12,
               learn_rate = 0.060244791,
               loss_reduction = 0.030219957,
               sample_size = 0.864302848) %>% set_engine("lightgbm")
  ) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(emergency_tscv$splits[[1]]))
wflw_fit_lightgbm_best_IG_45 %>% 
  extract_fit_parsnip() %>% 
  pull_importances()

I tried extracting with the pull_importances() function, but that throws an error:

No method for pulling feature importances is defined for _lgb.Booster

After training the model with the above preprocessing step the scores should be calculated but I can't find them. Please, can you help me. an example of how to view scores with the scores function

stevenpawley commented 1 year ago

Ah, thanks probably because LightGBM as a model isn't supported yet for extracting feature importances. XGBoost is supported, but I didn't add the method for LGBM yet. You could try the vip package on the extracted model object - although I'm not sure if it support LGBM yet either. I tend to still use XGBoost in R - that said, it should be easy to add a method and I'll take a look.

forecastingEDs commented 1 year ago

Hi,

yes LightGBM is not VIP-enabled or your package, but that's not what I need. The step of selecting variables for information gain (IG) should show the selected variables and the respective scores assigned by the IG with the scores function, but I cannot extract this information. In order to be able to demonstrate the results of the selection of variables for information gain, I need to find the table with the variables and scores generated by the IG. Can you provide a reprex? The VIP package will generate the importance of the variable by the VIP method and not by the selection of variables by IG. Note: For the other features selection methods like boruta, MRMR, etc I also can't see the scores the method assigned to each variable except vip which has a pull_importance function for that.

Grateful

forecastingEDs commented 11 months ago

Hello, @stevenpawley @topepo Can you please help me with this question?

stevenpawley commented 11 months ago

Hello, I'm taking a look now. A few things that immediately stand out are that you shouldn't import both recipeselectors and colino, which replaces it, because that might really mess with things. I think that the same would apply to treesnip and bonzai - the latter replaces the former.

Aside from the lack of a pull_importances method that supports LightGBM, you can get the what variables are removed by using:

wflw_fit_lightgbm_best_IG_45 %>% 
  extract_recipe() %>% 
  tidy(number = 4)

And if you want the scores:

wflw_fit_lightgbm_best_IG_45 %>% 
  extract_recipe() %>% 
  tidy(number = 4, type = "scores")
forecastingEDs commented 11 months ago

Hi @stevenpawley! Thank you for your time, it helped me a lot! I was able to generate the IG scores, but it is strange that 7 variables retained by the IG obtained 0.00000... points. Can you tell me if this is correct? Shouldn't the GI algorithm select only the variables that obtained positive scores?