Ranger returning NaN model predictions in some situations

JohnMount commented 7 years ago

I see Ranger returning NaN model predictions in some situations.

This reproduces reliably with some confidential client data, but I don't have sharable example of this. In some cases ranger returns a NaN model in probability classification mode, scoring training data that is entirely numeric and has no NAs/NaNs in it. I assume this has something to do with the following from ranger/src/ForestProbability.cpp (and similar code in ForestClassification.cpp and ForestRegression.cpp):

179   for (size_t j = 0; j < predictions[0][i].size(); ++j) {
180        predictions[0][i][j] = NAN;
181      }

This code very roughly looks like some sort of "tie breaking" or "no trees applied" kind of situation.

I am wondering if the fix is to set the vote to zero or the training prevalence probability instead of NaN in this case.

In some cases only one class probability goes to NaN. In other cases both class probabilities go to NaN and in this case the in-sample reported performance and confusion matrix are all NaN-ed out.

This is all calling Ranger through R with current CRAN version of ranger ‘0.7.0’, and we saw the same behavior in dev-ranger 0.7.2.

mnwright commented 7 years ago

For out-of-bag predictions this is expected behaviour: There are no OOB predictions possible if an observation is in-bag in all trees. The only way to avoid this is to increase the number of trees.

If only one class probability is NAN it seems to be another problem. Could you provide a reproducible example for this?

JohnMount commented 7 years ago

I'll see if I can cut down an example that the client will approve for release. Thanks for the help. Is there a control that determines if in-bag or out-of-bag predictions are made?

cole-brokamp commented 7 years ago

I've also seen this problem with larger datasets, but had problems reproducing it with a dummy dataset. After drilling down a little further using predict.all=TRUE it turns out that only one tree is predicting NaN. I'm not sure why an individual tree would predict NaN, but this in turn causes the predictions averaged across all trees to be NaN.

This is not the expected behavior if the problem was indeed related to number of trees. When increasing the number of trees, the number of NaN predictions increases too.

Also if one of the predictions is NaN, then the variable importance measures as well as OOB Rsq and MSE are NaN. My workaround has been to use predict.all=TRUE and then take the rowMeans with na.rm=TRUE to calculate the ensemble prediction, but this requires significant extra memory.

cole-brokamp commented 7 years ago

more updates... I can verify that for predictions returning NaN, there are plenty of OOB trees to use for the prediction.

In the documentation, it says that "nodes with size smaller than min.node.size can occur". Is is possible to have a terminal node with size zero? Increasing the min.node.size reduces the number of NaN predictions and vice-versa.

I have also found that all of the NaN predictions correspond to trees that contain NaN split values. Is this expected behavior?

mnwright commented 7 years ago

Thanks. It seems to boil down to NaN split values. Is it happening for probability prediction only? Could you try to simulate data similar to your actual data to reproduce?

Is there a control that determines if in-bag or out-of-bag predictions are made?

It's always out-of-bag or a new dataset with predict(). No in-bag predictions are used.

cole-brokamp commented 7 years ago

I've only tested ranger with regression random forests and the problem occurs when predicting continuous outcomes. I will try to simulate some data to see if I can create a reproducible problem. Thanks again.

apratap commented 7 years ago

FYI - I ran into a similar issue with regression random forests.