imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
772 stars 193 forks source link

Can not specify the classes of a prediction outcome #654

Open HaloCollider opened 1 year ago

HaloCollider commented 1 year ago

I'm tackling with a binomial classification task, where the dependent variable y is a numeric type instead of a factor type (namely 0 and 1), in the convenience of the following numeric calculation. My problem is that:

The prediction returned by the model is a n by 2 dataframe (or some datatype alike), with each column representing the probability of a class but has no column names. What's important is that the order of the columns does not necessarily match the "0 and 1" order, so I cannot simply use the second column's value as the probability of y = 1 in this binomial classification case. I haven't figure out the logic behind this, so it seems that the order is kind of randomly produced.

Therefore, I want to ask whether we have a way to specify the different classes (0 or 1) of a prediction outcome in a classification scenario. It would be greater if we don't have to convert y into a factor type because we will do lots of numeric calculations after predicting. Thanks.

HaloCollider commented 1 year ago

I think this can be a serious problem for classification. Luckily, we have a very unbalanced sample so we can easily see that the order changed for different models, because some of them produced the exactly opposite predictions if the order remained the same. Still took a long time for me to find out though......

mnwright commented 1 year ago

Could you please give a reproducible example of the problem?

stephematician commented 1 year ago

If the data are not a factor (assuming using R interface), then columns are ordered in the same order that the values appear in the data (by row).

Using the R interface, the columns should have the correct names, however this won't be obvious if using the C++ interface. I also don't believe this is documented.

krzyzinskim commented 1 year ago

I encountered the same problem. @HaloCollider, it's probably out of date by now but the order of the classes in the matrix of predicted probabilities can be found in your.model$forest$class.values (I think it's always in the right order).

And @mnwright, here a small reproducible example:

library(ranger)

## 0 is first 
set.seed(123)
p <- 4
n <- 1000
X <- data.frame(matrix(rnorm(n*p), nrow = n))
y <- as.numeric(rowSums(X) > 0)

y[1:5] # [1] 0 0 0 1 1

model <- ranger(x=X,
               y=y, 
               probability=TRUE)

prediction_probs <- predict(model, X)$predictions
prediction_probs[1:5, ]
#           [,1]        [,2]
# [1,] 0.9956444 0.004355556
# [2,] 0.9906111 0.009388889
# [3,] 0.8179349 0.182065079
# [4,] 0.0780381 0.921961905
# [5,] 0.3289381 0.671061905

model$forest$class.values # [1] 0 1

#### 

## 1 is first 
set.seed(42)
X <- data.frame(matrix(rnorm(n*p), nrow = n))
y <- as.numeric(rowSums(X) > 0)

y[1:5] # [1] 1 0 0 0 0

model <- ranger(x=X,
                y=y, 
                probability=TRUE)

prediction_probs <- predict(model, X)$predictions
prediction_probs[1:5, ]
#            [,1]       [,2]
# [1,] 0.96184603 0.03815397
# [2,] 0.04116032 0.95883968
# [3,] 0.12405714 0.87594286
# [4,] 0.03781984 0.96218016
# [5,] 0.18086905 0.81913095

model$forest$class.values # [1] 1 0

I've found here that the matrix is only given column names when forest$levels is not NULL (and it is for non-factor response, related resolved issue). Perhaps it's worth naming the columns based on forest$class.values, which is always non-empty?