Feature: Sparse rules matrix.

Weekend-Warrior commented 6 years ago

Hey Marjolein,

Great implementation, thank you for your contribution! I was wondering if you had considered a sparse matrix structure for your rules? Perhaps it would save memory and compute time for larger data?

Stewart

boennecd commented 6 years ago

I think it will be tedious in the tree part unless a package is used which can handle sparse design matrices and find the trees or work with hard disk instead of the rams. E.g., xgboost can be used for the later. Though, I assume this is not the bottle neck.

However, you have a good point for around this line https://github.com/marjoleinF/pre/blob/f7fc0d99eeca82b513a1656ad7e0175f8f1cf7d2/R/pre.R#L572. It may create a quite large logical matrix. Instead the Matrix::sparse.model.matrix may used which may reduce the memory usage. This will not make any saving though if each rule does have an exact (or close to exact) complement since each non-zero entry will require an additional integer value (if I remember correctly).

Further, the glmnet function does seem to support sparse design matrices so a few changes around here https://github.com/marjoleinF/pre/blob/f7fc0d99eeca82b513a1656ad7e0175f8f1cf7d2/R/pre.R#L754 may reduce the memory requirement later.

boennecd commented 6 years ago

I have implemented sparse design matrices in this branch. Here is an example

set.seed(32214026)
n <- 1000
p <- 10
X <- matrix(rnorm(n * p), ncol = p, dimnames = list(NULL, paste0("X", 1:p)))
Y <- rowSums(t(sin(t(X) + 2 * pi / (p - 1) * (seq_len(p) - 1) - pi)))
Y <- Y + rnorm(n, sd = sd(Y))

library(pre)
df <- data.frame(Y = Y, X)
set.seed(seed <- 89493792)
p1 <- profvis::profvis(f1 <- pre(Y ~ ., data = df))
set.seed(seed)
p2 <- profvis::profvis(f2 <- pre(Y ~ ., data = df, sparse = TRUE))
p1

p2

Notice the reduction in the computation time. The results are almost the same

f1
#R Final ensemble with cv error within 1se of minimum: 
#R   lambda =  0.05040629
#R   number of terms = 76
#R   mean cv error (se) = 4.116957 (0.1307587)
#R [output abbreviated]
f2
#R Final ensemble with cv error within 1se of minimum: 
#R   lambda =  0.05040629
#R   number of terms = 76
#R   mean cv error (se) = 4.116957 (0.1307587)
#R [output abbreviated]

though, sometimes the complement rule is used as it may be more sparse (the models should be identical however as it is just a matter of dummy coding)

all.equal(f1$glmnet.fit$glmnet.fit$beta, f2$glmnet.fit$glmnet.fit$beta)
#R [1] "Attributes: < Component “x”: Mean relative difference: 0.1409747 >"
do_diff <- f1$rules$description != f2$rules$description
head(cbind(f1$rules$description, f2$rules$description)[do_diff, ], 5)
#R      [,1]                      [,2]                        
#R [1,] "X6 <= 0.152972770478728" "!(X6 <= 0.152972770478728)"
#R [2,] "X6 <= 0.599537416561984" "!(X6 <= 0.599537416561984)"
#R [3,] "X9 <= 1.01063047110884"  "!(X9 <= 1.01063047110884)" 
#R [4,] "X6 <= 0.344295222323621" "!(X6 <= 0.344295222323621)"
#R [5,] "X6 <= 0.16736367293179"  "!(X6 <= 0.16736367293179)"

Your issue was mainly regarding the memory usage. The final model matrix is much smaller

object.size(f1$modmat) / 10^6 # size in MB
#R 14.5 bytes
object.size(f2$modmat) / 10^6 # size in MB
#R 5.3 bytes

I tried to debug through the code using gc(reset = TRUE) as described here. The memory usage seems to peak at 3 times that of the modmat in the above example. Do also see the issue here. I also tried to use the peakRAM package. It yields

library(peakRAM)
peakRAM(
  pre(Y ~ ., data = df                ), 
  pre(Y ~ ., data = df, sparse = TRUE))
#R                  Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
#R 1             pre(Y~.,data=df)            22.24               14.7             105.7
#R 2 pre(Y~.,data=df,sparse=TRUE)            17.56                4.7             106.1

which does not seem to be the correct peak memory usage from what I gather.

If you still have issues with the memory usage then consider using fewer and less deep trees. Consider using the gradient boosting option with a higher learning rate as this may imply that you need fewer trees.

marjoleinF commented 6 years ago

The sparse branch is now merged into the master branch, so I'm closing the issue.

marjoleinF / pre

Feature: Sparse rules matrix. #12