suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Rborist segfaults on Mac when run rapidly in succession #37

Closed stevenbagley closed 6 years ago

stevenbagley commented 6 years ago

On Mac OSX 10.12.6, R 3.4.3, Rborist 0.1-8, use the following code chunk:

library(Rborist)
data(iris)

run <- function(n){
    for(i in 1:n){
        cat(i, " ")
        validation_index <- sample(1:nrow(iris), 30)
        validation_data <- iris[-validation_index,]
        training_data <- iris[validation_index,]
        rb <- Rborist(training_data[, -5], training_data[, 5])
    } }

then: run(100)

will run 10 or 20 times, then seg fault:

> run(100)
1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  
 *** caught segfault ***
address 0x7fc7561143c0, cause 'memory not mapped'

Traceback:
 1: doTryCatch(return(expr), name, parentenv, handler)
 2: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 3: tryCatchList(expr, classes, parentenv, handlers)
 4: tryCatch(.Call("RcppTrainCtg", predBlock, preFormat$rowRank,     y, nTree, nSamp, rowWeight, withRepl, treeBlock, minNode,     minInfo, nLevel, maxLeaf, predFixed, splitQuant, probVec,     autoCompress, thinLeaves, FALSE, classWeight), error = function(e) {    stop(e)})
 5: Rborist.default(training_data[, -5], training_data[, 5], classWeight = 1:3)
 6: Rborist(training_data[, -5], training_data[, 5], classWeight = 1:3)
 7: run(100)
suiji commented 6 years ago

Thank you for providing a test case. Will try to reproduce.

suiji commented 6 years ago

Reproduces with 0.1-9 under Fedora 26.

Investigating.

suiji commented 6 years ago

This is an error in the splitting method for compressed predictor values. Until a fix is released, it should be possible to work around the error by turning off autocompression. This can be done by specifying "autoCompress=1.0".

suiji commented 6 years ago

Appears to be repaired. One of the sparse splitting methods employed a bad initialization. These methods are somewhat fragile, and should be reimplemented more cleanly.

"run(10000)" now completing without incident.

Please feel free to reopen this issue, as needed. Again, thank you for isolating the problem.