kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

Partial Function maxes CPU and crashes R on large random forest #443

Open sarahsmithtripp opened 3 weeks ago

sarahsmithtripp commented 3 weeks ago

Hi there, I have a large and complex random forest that I can fit using rfsrc. I have 30000 observations for 90 explanatory and 1 Y variable that is class based. I am trying to get partial dependence plots for some of these variables but am unable to do so because I believe it is maxing my CPU. I was able to run plot.variable(partial = TRUE), but my variable is class based and I am interested in how the importance of the X variable changes with respect to Y. I've included an adjustment of the iris dataset that produces the same issue. Note that sometimes it takes multiple hours before it crashes! Do you have any advice how to pull this partial dependence for the classes? Thank you so much, Sarah Smith-Tripp

# Load the iris dataset
data(iris)

# Set a seed for reproducibility
set.seed(123)

# Expand the iris dataset to 10,000 points by repeating and adding noise
num_repeats <- ceiling(10000 / nrow(iris))
iris_expanded <- iris[rep(1:nrow(iris), num_repeats), ]

# Add some noise to the numeric variables to introduce slight variations
for (col in 1:4) {
  iris_expanded[, col] <- iris_expanded[, col] + rnorm(nrow(iris_expanded), mean = 0, sd = 0.1)
}

# Truncate the dataset to exactly 10,000 rows
iris_expanded <- iris_expanded[1:30000, ]

# Add 45 numeric variables
for (i in 1:45) {
  iris_expanded[[paste0("Var", i)]] <- rnorm(nrow(iris_expanded))
}

# Add 5 factor variables with 3 levels each
for (i in 46:50) {
  iris_expanded[[paste0("Factor", i-45)]] <- factor(sample(1:3, nrow(iris_expanded), replace = TRUE), 
                                                    labels = c("Level1", "Level2", "Level3"))
}

# View the modified dataset
head(iris_expanded)
iris.obj <- rfsrc(Species ~., data = iris_expanded)
## partial effect for sepal length
partial.obj <- randomForestSRC::partial(iris.obj,
                  partial.xvar = "Sepal.Length",
                  partial.values = iris.obj$xvar$Sepal.Length,
                  get.tree = 1:500)

## extract partial effects for each species outcome
pdta1 <- get.partial.plot.data(partial.obj, target = "setosa")
pdta2 <- get.partial.plot.data(partial.obj, target = "versicolor")
pdta3 <- get.partial.plot.data(partial.obj, target = "virginica")
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8
[3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Canada.utf8

time zone: America/Vancouver
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.4.0 cli_3.6.3      jsonlite_1.8.8 rlang_1.1.4
kogalur commented 3 weeks ago

Let us take a look at this and get back to you. Thank you for the detailed example.

ishwaran commented 3 weeks ago

The supplied example calls a partial plot using all possible values of the conditioning variable which creates a large CPU task when n is large since prediction sample size will be O(n^2)

The simple fix is to replace:

partial.values = iris.obj$xvar$Sepal.Length,

with something like

partial.values = iris.obj$xvar$Sepal.Length[sample(1:iris.obj$n, 100,replace=FALSE)]