imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
775 stars 193 forks source link

Cholmod error 'problem too large' #548

Open jmpanfil opened 3 years ago

jmpanfil commented 3 years ago

I am running into an error using a sparse matrix.

model <- ranger(data = x, dependent.variable.name = "y", keep.inbag = TRUE, splitrule = "extratrees", quantreg = FALSE, verbose = TRUE, importance = 'impurity', probability = TRUE)

I can't share my data directly but x is a dgCMatrix from the Matrix package with dimensions of (6838778, 354) with 305,025,741 non-zero elements.

I get the error Error in as.vector(.Call(Csparse_to_vector, x), mode) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105. I am running R 4.0.3 with ranger_0.12.1. I have used this exact sparse matrix (other than the y column) with XGBoost without issues.

Is there anything I am doing wrong?

mnwright commented 3 years ago

Hm... That looks like the sparse matrix is converted to dense, which shouldn't happen. Could you try to simulate similar data to create a reproducible example?

jmpanfil commented 3 years ago

Sure, here's an example. Heads up this maxes out at a lot of RAM while making the sparse matrix (ends up being 17.6 GB but gets a lot higher than that while running I think). You can probably get the same result with fewer rows but I have it at 10 million by 400 columns.

library(Matrix)
library(ranger)

nr <- 10e6
nc <- 400

set.seed(23)
sp <- sparseMatrix(i = sample(1:nr, nr*nc / 2, replace = TRUE), 
                   j = sample(1:nc, nr*nc / 2, replace = TRUE),
                   x = ifelse(runif(nr*nc / 2) < .5, 0, 1))
y <- sample(c(0,1), nr, replace = TRUE)

sp <- cbind(sp, y)
colnames(sp) <- c(paste0('x', 1:nc), 'y')

model <- ranger(data = sp, 
                dependent.variable.name = "y", 
                num.trees = 5,
                keep.inbag = TRUE, 
                splitrule = "extratrees", 
                quantreg = FALSE, 
                verbose = TRUE, 
                importance = 'impurity', 
                probability = TRUE)
Error in as.vector(.Call(Csparse_to_vector, x), mode) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
mnwright commented 3 years ago

This seems to only happen if splitrule = "extratrees". I'll try to find the problem.