Open snvv opened 2 months ago
I would love to try to help you, but as I indicated in https://github.com/tripartio/ale/issues/6#issuecomment-2337559326 and https://github.com/tripartio/ale/issues/6#issuecomment-2345724766, I need a complete reproducible example, starting from library calls and everything. Without this, I cannot figure out what part of your code is not working.
Here is a representative example that produces errors
# Load the necessary library
library(Cubist)
library(caret)
library(ale)
library(dplyr)
# Load tale# Load the Ames housing dataset
data(ames, package = "modeldata")
# Model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)
# Set seed for reproducibility
set.seed(11)
# Split the data into training and testing sets
in_train_set <- sample(1:nrow(ames), floor(0.8 * nrow(ames)))
# Define the predictors for the model
predictors <- c(
"Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type",
"Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area",
"Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath",
"TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude"
)
# Create training and testing datasets
train_pred <- ames[in_train_set, predictors]
test_pred <- ames[-in_train_set, predictors]
train_resp <- ames$Sale_Price[in_train_set]
test_resp <- ames$Sale_Price[-in_train_set]
set.seed(123)
library(parallel)
# Calculate the number of cores
no_cores <- detectCores() - 1
library(doParallel)
# create the cluster for caret to use
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)
# Define training control
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid",
allowParallel = TRUE
)
# Define parameter grid for tuning
cubistGrid <- expand.grid(
committees = c(5, 30, 80, 100),
neighbors = c(5, 8, 9)
)
# Set seed for reproducibility
set.seed(123)
data <- cbind(train_pred, train_resp)
# Train the model with `caret`
m.cubist.2 <- caret::train(
x = select(data, -train_resp),
y = data$train_resp,
method = "cubist",
trControl = fitControl,
tuneGrid = cubistGrid
)
m.cubist.2
pre = predict(m.cubist.2, test_pred, neighbors = m.cubist.2$bestTune$neighbors)
progressr::handlers(global = TRUE)
progressr::handlers('cli')
cubist(data$train_resp~. data, data=data)
mb_cubist <- model_bootstrap(
test_pred,
m.cubist.2,
boot_it = 10, # 100 by default but reduced here for a faster demonstration
parallel = 20
)
colnames(train_pred)
mb_cubist <- model_bootstrap(
test_pred,
m.cubist.2,
model_call_string = 'caret::train(
x = select(data, -train_resp),
y = data$train_resp,
data = boot_data
)',
boot_it = 10, # 100 by default but reduced here for a faster demonstration
parallel = 14 # CRAN limit (delete this line on your own computer)
)
The error is the following
9% | Creating and analyzing models ETA: 4m
Progress interrupted by `purrr_error_indexed` condition: ℹ In index: 1.
Caused by error in `select()`:
! Can't select columns that don't exist.
✖ Column `train_resp` doesn't exist.
Error in (function (.x, .y, .f, ..., .progress = FALSE): ℹ In index: 1.
Caused by error in `select()`:
! Can't select columns that don't exist.
✖ Column `train_resp` doesn't exist.
Thanks for the reproducible example. So, it makes it clear that we really need to back up here. I need to change the direction of the case quite a bit.
The ale
package implements two different kinds of ALE bootstrap. The details are explained in my working paper on the package, but I can very simply say this:
ale()
function. model_bootstrap()
.So, for your use case, you should be working with the ale()
function, not model_bootstrap()
. This should greatly simplify things for you.
So, coming back to your example, I will first ignore the fact that the line cubist(data$train_resp~. data, data=data)
is buggy. I don't know what it's supposed to do and it doesn't seem to be relevant to the example.
First, as I indicated in https://github.com/tripartio/ale/issues/6#issuecomment-2346249943, we need to add the response data (Sale_Price
) to the data. Then, after training the model m.cubust.2, here's what I have:
library(ale)
# ale() function requires the y column data included in the dataset
ames_pred <- ames |>
dplyr::select(Sale_Price, all_of(predictors))
ale_cubist <- ale(
ames_pred, # dataset modified to include the y_col column
m.cubist.2,
y_col = 'Sale_Price', # y_col required for non-standard model
pred_type = 'raw', # pred_type for caret::train() regression models is 'raw'
boot_it = 10,
parallel = 20
)
#> Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 20
#> localhost parallel workers with only 12 CPU cores available for this R process
#> (per 'system'), which could result in a 167% load. The soft limit is set to
#> 100%. Overusing the CPUs has negative impact on the current R process, but also
#> on all other processes of yours and others running on the same machine. See
#> help("parallelly.options", package = "parallelly") for how to override the soft
#> and hard limits
ale_cubist$plots$Year_Built
Created on 2024-09-12 with reprex v2.1.1
(You can ignore the warning; it's just because I only have 12 cores, not 20 like you. But the code still works.)
So, bootstrapping is as simple as specifying the boot_it
argument in the ale()
function for machine learning models.
Perhaps I should document the model_bootstrap()
function better to clarify its appropriate use cases.
I forgot to mention: after training the model, please remember to close the parallel clusters before you do anything else (especially before running ale()
with its own parallelization:
stopCluster(cl)
Otherwise, there are no cores available for ale()
parallelization.
Thank you for your answer. You are correct, some pieces of code was left behind and it was irrelevant to the example. This way, I can get confidence intervals in plot. However we don't have model_stats.
Always I use stopCluster(), however the ale function utilizes only 30-40% of the CPU even the data frame is quite big [8000, 15]
What exactly were you hoping to get from model_stats
? The m.cubist.2
object is a caret::train
S3 object, so I'm not sure what you were hoping that model_bootstrap()
would do for you that bootstrapped ale()
does not do.
Hello I am having some problems with bootstrapping plain Cubist models and tuning them using
caret
. Below are the models and the corresponding errors.Model 2: Tuned Cubist Model with
caret
Error Messages
When running the models, the following errors were encountered:
The error suggests that the functions
cubist.default
andtrain.default
are not recognized. After reviewing the manual, it's not clear how to setboot_data
for these models.I would appreciate any insights on how to resolve this issue and correctly set up
boot_data
for the models.