suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Rborist Error in doTryCatch(return(expr), name, parentenv, handler) : std::bad_alloc #36

Closed gse-cc-git closed 6 years ago

gse-cc-git commented 6 years ago

Hello, Trying to run a classification job (~30 classes) on a dataset of ~4.5M rows and 8 predictors (both categorical and quantitative) succeeds with nTree = 50 but return the following error when raising the number of trees (ex 500 fails).

Error in doTryCatch(return(expr), name, parentenv, handler) :    std::bad_alloc

Any advice would be helpful.

Also: is there any way to combine the forests in a way similar to randomForest::combine ?

Thank you very much.

suiji commented 6 years ago

It looks like the session is out of memory, although the dimensions of your data do not seem particularly large. Does reducing the number or rows, for example, raise the threshold on the number of trees that can be successfully trained? Can you provide a minimal example? - otherwise it's down to guesswork.

There is no feature analagous to "combine", but this could easily be supported. It is being added to the TODO file. Thank you for the suggestion.

gse-cc-git commented 6 years ago

You are right. My description of the issue was not very helpful. I was puzzled by the meaning of the Error message. I successfully ran Rborist on a similar amount of data using the same environment (Rstudio server allowing 64Go of RAM). The difference was that the predictors were exclusively categorical and declared as factors. A run of ~20minutes delivered a final object taking 21GB in memory.

I ran the dataset with numerical predictor outside the Rstudio environement, ie though "bare R", and could fully access to the server memory. While monitoring the process with htop I saw a record of ~130GB of VIRTual memory. After hours, the final model saved in rds is almost 10GB big for 1000 trees.

I was obviously wrong thinking that the memory imprint would be similar in both the full categorical or full numerical cases.

Thank you for your great contribution.

suiji commented 6 years ago

Both cases exhibit what seems like excessive memory consumption for such a modestly-sized data set.

Have you tried setting 'thinLeaves=TRUE'? This bypasses creation and saving of auxiliary information used by quantile regression and feature contribution. Perhaps this setting should be the default.

suiji commented 6 years ago

The slow training time for your 1000-tree case is probably due to swapping. Once again, turning on 'thinLeaves' should cut way down on memory consumption and thus prevent swapping.

Of course, this does not solve the problem of excessive memory usage when auxiliary information actually is desired. We might want the auxiliary information to go straight to disk.

gse-cc-git commented 6 years ago

In each case, I used the thinLeaves=TRUE without any success. Do you think using the package data.table which make use of several threads when available could mess up the process ?

suiji commented 6 years ago

In each case, I used the thinLeaves=TRUE without any success.

Is there any reduction in memory usage or execution time in either case?

We have done a lot of timings with the flight-delay benchmark. It employs categorical response and has characteristics similar to your data, with 8 predictors mixed between categorical and numeric. Training 10 million rows on 100 trees has a high watermark of roughly 15GB when 'thinLeaves' is set to TRUE. So, using the benchmark as a guide, your case ought to be using on the order of 10GB. Can you confirm this estimate with your own data when restricting to 100 trees? This would give me some confidence that we are on the same page.

It seems unlikely that data.table is not freeing its threads. One way to rule this out, though, would be to break up training into two steps. Assuming that the data frame resides in variable 'x' and that the response resides in 'y', proceed as follows:

By preformatting 'x', all of data.table's work should be concluded before actual training begins. You might also want to check memory and thread usage before, between and after executing these commands.

We have also observed egregious slowdowns of this sort when a zombie process is still running. In particular, we have seen amazing degradations when R is restarted after premature termination of an earlier session.

suiji commented 6 years ago

As already mentioned, with neither a test case nor a collaborative effort, it will be down to guesswork. Nonetheless, some observations are in order:

The Arborist currently does not have out-of-core support. In particular, the entire forest resides in memory during either training or prediction. Out-of-core support has been on the TODO list for some time and we hope to provide it in future.

There are several options available to the user that can help reduce memory footprint. In addition to 'thinLeaves', which prunes summary information, there are also ways to control tree size, albeit possibly affecting predictive quality. These are outlined in the vignette.

Using the flight-delay data set as a rough guide, a 50GB footprint might be expected for similar data sets with 4.5 million rows run with 1000 trees. The reported high-watermark of 120GB seems gargantuan, though. This might be due to a spike in memory consumption during the process of copying the crescent forest into its final form. In any case the slow training time for the 1000 tree case is one hallmark of performance degradation due to swapping.

Please feel free to reopen this issue as needed.

suiji commented 6 years ago

Moving this Issue to a new thread, "Memory spike on write".

gse-cc-git commented 6 years ago

Some more details for collaborative effort and reproducibility: (if of any help).

My script:

library(data.table)
library(Rborist)
dt <- readRDS("dt.rds")
ypred <- dt$series

pf1 <- PreFormat(dt[,.SD,.SDcols=-c("x","y","series")])

rm(dt);gc()

started.at=proc.time()
rb_xy <- Rborist(pf1,ypred,nTree=100,thinLeaves=TRUE)
saveRDS(rb_xy,"rb_pf1_nTree100.rds")
sink("duree_rborist.txt")
cat('Finished in',data.table::timetaken(started.at))

Sometimes, htop shows 56 processes of R with a ~25GB VIRT memory imprint, other times it fails directly, each time with the following error message:

 *** caught segfault ***
address 0x68bc0704, cause 'memory not mapped'

Traceback:
 1: .Call("RcppTrainCtg", predBlock, preFormat$rowRank, y, nTree,     nSamp, rowWeight, withRepl, treeBlock, minNode, minInfo,     nLevel, maxLeaf, predFixed, splitQuant, probVec, autoCompress,     thinLeaves, FALSE, classWeight)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(.Call("RcppTrainCtg", predBlock, preFormat$rowRank,     y, nTree, nSamp, rowWeight, withRepl, treeBlock, minNode,     minInfo, nLevel, maxLeaf, predFixed, splitQuant, probVec,     autoCompress, thinLeaves, FALSE, classWeight), error = function(e) {    stop(e)})
 6: Rborist.default(pf1, ypred, nTree = 100, thinLeaves = TRUE)
 7: Rborist(pf1, ypred, nTree = 100, thinLeaves = TRUE)

The environment:

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /opt/R/3.4.3/lib64/R/lib/libRblas.so
LAPACK: /opt/R/3.4.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rborist_0.1-9       Rcpp_0.12.14        data.table_1.10.4-3

loaded via a namespace (and not attached):
[1] compiler_3.4.3

@suiji, I can share the dataset through large file sharing system (~300MB zipped rds).

I suspect this issue can be related to the other applications run on the server by other users but I don't know how to question the admin on that subject. However, the "job scheduler" should manage competition between processes, isn't it ?

gse-cc-git commented 6 years ago

Ok. Seems to be solved by the use of the trick autoCompress = 1.0 that turns off auto-compression by setting an unattainable threshold as suggested here I also had to run the command outside of Rstudio: ie invoking R in a terminal. It took around 25minutes with 100 trees. I will try with 1000 trees.

suiji commented 6 years ago

This particular autocompression problem appears to be fixed in the latest version, in case you would like to clone it from Github. It's a relief to learn that hammering the threshold gives you a workaround in the meantime.

25 minutes seems slow, based on your description of the problem set, but if you are running on a loaded system, then all bets are off. Memory footprint also looks obscene, even given the fact that the version you are using has a spike.