Bootstrapping Cubist and caret-tuned Cubist models fails.

snvv commented 2 months ago

Hello I am having some problems with bootstrapping plain Cubist models and tuning them using caret. Below are the models and the corresponding errors.


## Model 1: Simple Cubist Model

```r
m.cubist.1 <- cubist(
  x = subset(trainData, select = -pESG), 
  y = trainData$pESG
)

Model 2: Tuned Cubist Model with `caret`

# Define training control
fitControl <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 3, 
  search = "grid", 
  allowParallel = TRUE
)

# Define parameter grid for tuning
cubistGrid <- expand.grid(
  committees = c(5, 10, 20, 30, 50, 80, 90, 100), 
  neighbors = c(0, 1, 3, 5, 6, 7, 8, 9)
)

# Set seed for reproducibility
set.seed(seed)

# Train the model with `caret`
m.cubist.2 <- train(
  x = subset(trainData, select = -pESG),
  y = trainData$pESG,
  method = "cubist", 
  trControl = fitControl, 
  tuneGrid = cubistGrid
)

Error Messages

When running the models, the following errors were encountered:

Errors for Model 1 & 2:

9% | Creating and analyzing models ETA:  1m
Progress interrupted by purrr_error_indexed condition: ℹ In index: 1.
Caused by error in `cubist.default()`:
! could not find function "cubist.default"
and for Model 2 
! could not find function "train.default"

The error suggests that the functions cubist.default and train.default are not recognized. After reviewing the manual, it's not clear how to set boot_data for these models.

I would appreciate any insights on how to resolve this issue and correctly set up boot_data for the models.

tripartio commented 2 months ago

I would love to try to help you, but as I indicated in https://github.com/tripartio/ale/issues/6#issuecomment-2337559326 and https://github.com/tripartio/ale/issues/6#issuecomment-2345724766, I need a complete reproducible example, starting from library calls and everything. Without this, I cannot figure out what part of your code is not working.

snvv commented 2 months ago

Here is a representative example that produces errors

# Load the necessary library
library(Cubist)
library(caret)
library(ale)
library(dplyr)

# Load tale# Load the Ames housing dataset
data(ames, package = "modeldata")

# Model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)

# Set seed for reproducibility
set.seed(11)

# Split the data into training and testing sets
in_train_set <- sample(1:nrow(ames), floor(0.8 * nrow(ames)))

# Define the predictors for the model
predictors <- c(
  "Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type", 
  "Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area", 
  "Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath", 
  "TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude"
)

# Create training and testing datasets
train_pred <- ames[in_train_set, predictors]
test_pred  <- ames[-in_train_set, predictors]

train_resp <- ames$Sale_Price[in_train_set]
test_resp  <- ames$Sale_Price[-in_train_set]

set.seed(123)
library(parallel)
# Calculate the number of cores
no_cores <- detectCores() - 1
library(doParallel)
# create the cluster for caret to use
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

# Define training control
fitControl <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 3, 
  search = "grid", 
  allowParallel = TRUE
)

# Define parameter grid for tuning
cubistGrid <- expand.grid(
  committees = c(5, 30, 80, 100), 
  neighbors = c(5, 8, 9)
)

# Set seed for reproducibility
set.seed(123)

data <- cbind(train_pred, train_resp)

# Train the model with `caret`
m.cubist.2 <- caret::train(
  x = select(data, -train_resp),
  y = data$train_resp,
  method = "cubist", 
  trControl = fitControl, 
  tuneGrid = cubistGrid
)

m.cubist.2

pre = predict(m.cubist.2, test_pred, neighbors = m.cubist.2$bestTune$neighbors)

progressr::handlers(global = TRUE)
progressr::handlers('cli')

cubist(data$train_resp~. data, data=data)
mb_cubist <- model_bootstrap(
  test_pred, 
  m.cubist.2,
  boot_it = 10,  # 100 by default but reduced here for a faster demonstration
  parallel = 20
)
colnames(train_pred)

mb_cubist <- model_bootstrap(
  test_pred, 
  m.cubist.2,
  model_call_string = 'caret::train(
    x = select(data, -train_resp),
    y = data$train_resp,
    data = boot_data
)',
  boot_it = 10,  # 100 by default but reduced here for a faster demonstration
  parallel = 14  # CRAN limit (delete this line on your own computer)
)

The error is the following

9% | Creating and analyzing models ETA: 4m  
Progress interrupted by `purrr_error_indexed` condition: ℹ In index: 1.  
Caused by error in `select()`:  
  ! Can't select columns that don't exist.  
  ✖ Column `train_resp` doesn't exist.  

Error in (function (.x, .y, .f, ..., .progress = FALSE): ℹ In index: 1.  
Caused by error in `select()`:  
  ! Can't select columns that don't exist.  
  ✖ Column `train_resp` doesn't exist.

tripartio commented 2 months ago

Thanks for the reproducible example. So, it makes it clear that we really need to back up here. I need to change the direction of the case quite a bit.

The ale package implements two different kinds of ALE bootstrap. The details are explained in my working paper on the package, but I can very simply say this:

For machine learning where you carry out hyperparameter optimization on a training subset and select the best model among many candidates, you bootstrap with the ale() function.
For statistical analysis where you use only one model on the entire dataset, you bootstrap with model_bootstrap().

So, for your use case, you should be working with the ale() function, not model_bootstrap(). This should greatly simplify things for you.

So, coming back to your example, I will first ignore the fact that the line cubist(data$train_resp~. data, data=data) is buggy. I don't know what it's supposed to do and it doesn't seem to be relevant to the example.

First, as I indicated in https://github.com/tripartio/ale/issues/6#issuecomment-2346249943, we need to add the response data (Sale_Price) to the data. Then, after training the model m.cubust.2, here's what I have:

library(ale)

# ale() function requires the y column data included in the dataset
ames_pred <- ames |> 
  dplyr::select(Sale_Price, all_of(predictors))

ale_cubist <- ale(
  ames_pred,  # dataset modified to include the y_col column
  m.cubist.2,
  y_col = 'Sale_Price',  # y_col required for non-standard model
  pred_type = 'raw',  # pred_type for caret::train() regression models is 'raw'
  boot_it = 10,
  parallel = 20
)
#> Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 20
#> localhost parallel workers with only 12 CPU cores available for this R process
#> (per 'system'), which could result in a 167% load. The soft limit is set to
#> 100%. Overusing the CPUs has negative impact on the current R process, but also
#> on all other processes of yours and others running on the same machine. See
#> help("parallelly.options", package = "parallelly") for how to override the soft
#> and hard limits
ale_cubist$plots$Year_Built

^{Created on 2024-09-12 with reprex v2.1.1}

(You can ignore the warning; it's just because I only have 12 cores, not 20 like you. But the code still works.)

So, bootstrapping is as simple as specifying the boot_it argument in the ale() function for machine learning models.

tripartio commented 2 months ago

Perhaps I should document the model_bootstrap() function better to clarify its appropriate use cases.

tripartio commented 2 months ago

I forgot to mention: after training the model, please remember to close the parallel clusters before you do anything else (especially before running ale() with its own parallelization:

stopCluster(cl)

Otherwise, there are no cores available for ale() parallelization.

snvv commented 2 months ago

Thank you for your answer. You are correct, some pieces of code was left behind and it was irrelevant to the example. This way, I can get confidence intervals in plot. However we don't have model_stats.

Always I use stopCluster(), however the ale function utilizes only 30-40% of the CPU even the data frame is quite big [8000, 15]

tripartio commented 2 months ago

What exactly were you hoping to get from model_stats? The m.cubist.2 object is a caret::train S3 object, so I'm not sure what you were hoping that model_bootstrap() would do for you that bootstrapped ale() does not do.

tripartio / ale