suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

A performance issue of Rborist #52

Closed flippercy closed 3 years ago

flippercy commented 4 years ago

Dear Mark:

I used Rborist for a while and figured out that a lot of times it was underperforming compared with other R packages for random forests such as randomForest and ranger.

For example, I created three random forests using the Breast Cancer Coimbra data set on UCI ([https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra]) with rborist, randomForest and ranger. The codes are as below:

BreastCancerData <- read.csv(file="P:/..../Data/BreastCancerCoimbra.csv") # I renamed the last column to 'Diagnosis' in the csv file

train.idx <- sample(nrow(BreastCancerData), 2/3 * nrow(BreastCancerData))

targetVariable <- 'Diagnosis'

BreastCancer.train <- BreastCancerData[train.idx, ] BreastCancer.test <- BreastCancerData[-train.idx, ]

ranger.BreastCancer <- ranger(Diagnosis ~ ., data = BreastCancer.train) pred.BreastCancer.ranger <- predict(ranger.BreastCancer, data = BreastCancer.test)

randomForest.BreastCancer <- randomForest(Diagnosis ~ ., data = BreastCancer.train) pred.BreastCancer.randomForest <- as.vector(predict(randomForest.BreastCancer, newdata = BreastCancer.test))

rborist.BreastCancer <- Rborist(x=BreastCancer.train[, 1:9], y=BreastCancer.train[, 10]) pred.BreastCancer.rborist <- predict(rborist.BreastCancer, newdata = BreastCancer.test)

The performance of the three models on the test dataset is:

ranger.BreastCancer: KS = 0.571 and AUC = 0.854 randomForest.BreastCancer: KS = 0.563 and AUC =0.857 rborist.BreastCancer: KS=0.516 and AUC = 0.831

As you can see above, with default settings, the model built by Rborist was two points worse in AUC and 5 points worse in KS compared with the other two models. When I built models with my dataset, the gap was even wider.

Could you tell me what caused this gap? Any idea on how to tune the Rborist function to improve its performance?

Appreciate your help!

suiji commented 4 years ago

Thank you for taking the time to frame the problem and submit a report, which helps to narrow down the problem considerably. This looks like a good one. Please stand by.

suiji commented 4 years ago

Response column "Diagnosis" takes integer values, which are treated as numeric. Hence the package builds a regression model. The randomForest package issues a warning to this effect. Here are a couple of comments based on the approach:

i) If the response is cast to categorical using as.factor(), then the misprediction rates are actually lower using Rborist than using randomForest, for example.

ii) If you really do want a regression model, though, accuracy can be notched up a bit using the predFixed, option, known as mtry in other packages. Similar symptoms were also identified in a recent survey comparing a variety of open-source regression packages: https://doi.org/10.1109/ACCESS.2019.2933261

The default method of sampling predictors for regression splitting candidates employs Bernoulli sampling with a prespecified probability. The approach seems to work well as the number of predictors grows, but you and others have observed problems with narrower data sets. This is likely due to the high variance in the number of candidates yielded. Our plan is to mimic randomForest at lower predictor count with a default predFixed computation.

You should be able to obtain comparable results to randomForest using predFixed = floor(m / 3), where m is the number of predictors. I have been able to verify this locally using the value 4 for your data.

Please give this a try.

flippercy commented 4 years ago

Thank you for the quick response! I will test it.

In addition, when working with weighted dataset, shall I set rowWeight = weightcolumn?

suiji commented 4 years ago

Depending on what "weightcolumn" is, the following might apply:

rowWeight is used to weight samples, for bagging. predWeight is used to weight predictors, for candidate selection.

flippercy commented 4 years ago

The model performs better after I converted the binary target variable to factors; however, the monotonicity did not work with categorical target variables. Is there a way to build a monotonic classification random forest using Rborist?

Appreciate your help!

suiji commented 4 years ago

We do not support ordinal regression. No one has requested it, until now.

If you continue with your initial approach, treating factor levels as numeric, you may have success with monotonic regression by employing round-to-nearest on the predictions.

suiji commented 4 years ago

Just an update.

We have been able to improve the accuracy of classification with the last bolus of changes. Accuracy under autocompression still appears to lag, although only slightly. Although this was not the subject of the original posting, the thread will remain open until the accuracy issue is resolved.

flippercy commented 4 years ago

Thank you for the update!

suiji commented 4 years ago

FWIW, this is the only issue gating release of 0-2.4.

Accuracy does seem to improve when the 'minInfo' default is made 0.0. This change is therefore slated, bringing the default behaviour even closer to the Breiman-Cutler approach of splitting to purity.

There is also a slight bias in the accuracy of classification under autocompression. This needs to be remedied before releasing.

suiji commented 3 years ago

We did find the cause of the autocompression bias. The implicit bit position was always being identified with the "proxy" bit used for unseen categories, rather than floating freely. The fix will be uploaded to 0-3.0, under development.

flippercy commented 3 years ago

Thank you for the work!

suiji commented 3 years ago

It turns out that the bias exhibited by autocompression was only one symptom of a deeper problem. The way we encode factor-valued splits introduces a bias under prediction with "new" values - that is, factor levels not encountered during training. This is due to a spurious correlation between the encoding and the branch sense, i.e., whether a left or right branch is taken.

There are several ways in which "new" or non-training values can be introduced during prediction and there are several ways to treat them. For a continuous predictor, an obvious example would be an "NA" in the new data set. For a factor-valued predictor, there might be a new factor level not present in the training set. There might also, though, be a factor level which was present in the training set but was not present in a given node during training: nodes are trained on recursively-partitioned subsets of the training set.

Several options come to mind for treating new values on prediction: randomization, criterion-negation and bailing. With randomization, taking the left or the right branch is equally likely. With criterion-negation, the branch taken is whichever one is implied by failing the splitting criterion. With bailing, prediction halts and a score associated with the splitting nonterminal is returned.

Bailing is intuitively appealing because it does not resort to guessing. Prediction simply gives up and returns the best estimate obtained so far, albeit one which is less precise than if terminating at a leaf. This is how we will handle NA's, as doing so only requires an inexpensive modification to prediction. Bailing is not practical for unseen factor levels, however, as training would be required to record, for each node, all levels visible when splitting that node.

Randomization also has its intuitive appeal and yields precise guesses. In the ensemble limit, the guesses should approximate the distribution of values obtained by bailing. Randomization incurs a performance penalty due calling a random-variate generator either multiple times during prediction or sporadically, in preallocated blocks. It also demands the same additional bookkeeping of training as does bailing.

We use criterion-negation to treat unseen factor levels. Training enumerates the levels taking, say, the left branch at prediction. All other levels, regardless whether present when the node was split (i.e., trained), take the complementary branch at prediction. This affords a compact representation for the decision tree, as only the levels passing the criterion need to be encoded. In particular, the only query posed during prediction at a node is whether the factor level of the new data lies in the specified set. Note that either a given set of factor levels or its complement may be chosen to encode a branch.

This scheme, while parsimonious and fast, can introduce spurious correlations of the sort cautioned against in basic classes on stochastic modeling. This can happen as a result of biases unwittingly introduced during training as, for example, occurred in the case of autocompression. It also appears likely to happen due to the way we order factor levels during splitting. We believe we can break these correlations by randomizing level encoding during training. In particular, if we randomly encode either a subset of splitting levels or its complement, then it should be possible to eliminate potential biases at prediction.

suiji commented 3 years ago

This appears to be resolved with the latest push of 0.3.

Please feel free to reopen if additional related problems are encountered.