imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
774 stars 193 forks source link

Ranger returning NaN model predictions in some situations #201

Closed JohnMount closed 5 years ago

JohnMount commented 7 years ago

I see Ranger returning NaN model predictions in some situations.

This reproduces reliably with some confidential client data, but I don't have sharable example of this. In some cases ranger returns a NaN model in probability classification mode, scoring training data that is entirely numeric and has no NAs/NaNs in it. I assume this has something to do with the following from ranger/src/ForestProbability.cpp (and similar code in ForestClassification.cpp and ForestRegression.cpp):

179   for (size_t j = 0; j < predictions[0][i].size(); ++j) {
180        predictions[0][i][j] = NAN;
181      }

This code very roughly looks like some sort of "tie breaking" or "no trees applied" kind of situation.

I am wondering if the fix is to set the vote to zero or the training prevalence probability instead of NaN in this case.

In some cases only one class probability goes to NaN. In other cases both class probabilities go to NaN and in this case the in-sample reported performance and confusion matrix are all NaN-ed out.

This is all calling Ranger through R with current CRAN version of ranger ‘0.7.0’, and we saw the same behavior in dev-ranger 0.7.2.

mnwright commented 7 years ago

For out-of-bag predictions this is expected behaviour: There are no OOB predictions possible if an observation is in-bag in all trees. The only way to avoid this is to increase the number of trees.

If only one class probability is NAN it seems to be another problem. Could you provide a reproducible example for this?

JohnMount commented 7 years ago

I'll see if I can cut down an example that the client will approve for release. Thanks for the help. Is there a control that determines if in-bag or out-of-bag predictions are made?

cole-brokamp commented 7 years ago

I've also seen this problem with larger datasets, but had problems reproducing it with a dummy dataset. After drilling down a little further using predict.all=TRUE it turns out that only one tree is predicting NaN. I'm not sure why an individual tree would predict NaN, but this in turn causes the predictions averaged across all trees to be NaN.

This is not the expected behavior if the problem was indeed related to number of trees. When increasing the number of trees, the number of NaN predictions increases too.

Also if one of the predictions is NaN, then the variable importance measures as well as OOB Rsq and MSE are NaN. My workaround has been to use predict.all=TRUE and then take the rowMeans with na.rm=TRUE to calculate the ensemble prediction, but this requires significant extra memory.

cole-brokamp commented 7 years ago

more updates... I can verify that for predictions returning NaN, there are plenty of OOB trees to use for the prediction.

In the documentation, it says that "nodes with size smaller than min.node.size can occur". Is is possible to have a terminal node with size zero? Increasing the min.node.size reduces the number of NaN predictions and vice-versa.

I have also found that all of the NaN predictions correspond to trees that contain NaN split values. Is this expected behavior?

mnwright commented 7 years ago

Thanks. It seems to boil down to NaN split values. Is it happening for probability prediction only? Could you try to simulate data similar to your actual data to reproduce?

Is there a control that determines if in-bag or out-of-bag predictions are made?

It's always out-of-bag or a new dataset with predict(). No in-bag predictions are used.

cole-brokamp commented 7 years ago

I've only tested ranger with regression random forests and the problem occurs when predicting continuous outcomes. I will try to simulate some data to see if I can create a reproducible problem. Thanks again.

apratap commented 7 years ago

FYI - I ran into a similar issue with regression random forests.

markroepke commented 7 years ago

I've also run into this issue with regression method of random forest.

mnwright commented 5 years ago

Please reopen if problem occurs again.

cjvanlissa commented 5 years ago

I am experiencing this issue. Is there a fix on the way?

mnwright commented 5 years ago

I'd like to fix it but still cannot reproduce. Do you have a reproducible example?

Titan100 commented 5 years ago

I have same problem i.e returns NaN for Penalized Discriminant Analysis with this: predict(pda, test.set[,-1], type = "prob")

mnwright commented 5 years ago

There is no type = "prob" in ranger. Maybe you are confusing packages?

bunnell commented 5 years ago

Looks like we're seeing this issue as well. It surfaces as a substantial number of NAs reported for one class. Our data set is highly imbalanced, with 2377 cases in one class and 297 in the other and 16,571 features per case. Submitting the whole dataset as is, runs well and does not created any NAs. Of course, it produces useless results (misclassifies 72% of the smaller class cases).

I tried to improve the sensitivity by setting sample.fraction = c(0.1, 0.9). This is when it begins to generate NAs in the output.

mnwright commented 5 years ago

@bunnell Is that for the out-of-bag predictions or with new data? For out-of-bag and modified sample.fraction some NAs are expected because the observations with high sample.fraction might be in-bag in all trees and thus cannot have an out-of-bag prediction. Again, a reproducible example would help.

kmishra9 commented 5 years ago

Same happening for me -- running in regression mode. I've also fit with the randomForest package and no issues there so that's interesting. I also looked directly at any rows in the dataset for which there was a NaN prediction and nothing jumped out at me as being unusual. For context, I was referencing the predictions like so: model_6$predictions, which I assume is the training data predictions.

This was disappointing because I was using ranger for its ability to do weighted regression, which is regretfully absent in any other package.

mnwright commented 5 years ago

For context, I was referencing the predictions like so: model_6$predictions, which I assume is the training data predictions.

No, these are the out-of-bag predictions.

Still, we need a reproducible example.

portolan75 commented 3 years ago

Hi @mnwright , I think I have a reproducible example.

Took me a while, because it happens when dealing with imbalanced data, when the imbalance ratio is below a certain threshold. For me it's a pity is not working, I'm very happy with the speed of ranger but unfortunately I can't use it due to this problem. The idea in general is to balance the sample within the chosen samplesize.

Here's the reproducible example

Data for reproducible example data(mtcars) table(mtcars$am) class(mtcars)

Create a heavy class imbalance situation (can be even more heavy in reality) Make am my variable to predict as.factor (classification) mtcars$am[1:30] <- 0 mtcars$am <- as.factor(mtcars$am) table(mtcars$am)

Calculate weights to assign to each obs. to re-balance the sample to 50% each class. (think of it when it's really class-imbalanced and this weights are more extreme) w_am = 1/table(mtcars$am) w_am = w_am/sum(w_am) weights_am <- ifelse(mtcars$am %in% 0, w_am[1], w_am[2])

Replicate 20 times OOB_RMSE_W <- vector(mode = "numeric", length = 20) for(i in seq_along(OOB_RMSE_W)) {

train model optimal_ranger_w <- ranger( formula = am ~ ., data = mtcars, num.trees = 500, mtry = 7, min.node.size = 3, sample.fraction = .8, splitrule = "gini", classification = TRUE, probability = TRUE, importance = "impurity", replace = TRUE, case.weights = weights_am, keep.inbag = TRUE )

add OOB error to grid OOB_RMSE_W[i] <- sqrt(optimal_ranger_w$prediction.error) } which(is.na(optimal_ranger_w$predictions))

mnwright commented 2 years ago

That is expected behavior: You up-weight the observations from the minority class so much that they are selected in the bootstrap samples of all trees. Because of that you cannot calculate OOB predictions for these observations (they are never OOB).

Prediction will work as usual, you just cannot use OOB estimation if you have quite extreme case.weights.

portolan75 commented 2 years ago

Hi @mnwright thanks for your answer, but sorry I think this is a scientifically possible case called Balanced Random Forest. The idea is that one can sample from the entire minority in class in case of imbalanced datasets. I found myself already in a couple of situations where this happens (Fraud detection datasets). In my last case I had nearly 250.000 cases of which circa 500 frauds. So if I wanted to balance every bootstrap sample by selecting 500 cases from the minority class with repetitions (so that not all the 500 will finish in the sample) how else should I do it? (in my case the weights were even more extreme that the ones in the reproducible example). And if I am forced to sample even less than 500, how can i have representativity (considering the non-frauds are 250.000)? I found inspiration from this article from Berkeley (section 2.2), where they describe the technique. I can imagine this might create some issues when it comes to the OOB calculations, but maybe adding a sort of tradeoff option that if one wants to balance the sample then it will loose the OOB validations?....One can always do that on a separate validation set. https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

Please also notice that in my example if you use min.node.size = 2 (instead of 3), it works.

mnwright commented 2 years ago

I didn't want to imply that this is an unlikely case or that it doesn't make sense. It's just expected to not get OOB estimates when doing extreme oversampling (as explained above). Other options are undersampling and/or cost-sensitive learning. For undersampling you could play around with case.weights and sample.fraction or use a vector for sample.fraction as explained in #480 (that's the easier way). For cost-sensitive learning, use class.weights.