imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
776 stars 194 forks source link

Poisson regression #433

Open mayer79 opened 5 years ago

mayer79 commented 5 years ago

I remember darkly some issues about new objectives/gains choices for ranger. One very relevant one is Poisson regression for (pseudo-)counts. Why? It is one of the standard approaches to model number of claims in insurance and typically strongly reacts on a bad choice of loss/objective function due to low signal to noise ratio. Typically, one minimizes Poisson deviance, a quantity derived from Maximum-Likelihood estimation. It is implemented in XGBoost and LightGBM (I would refer to their specific implentations, especially how they deal with log-link), but I am not aware of any proper random forest implementation. So it would be extremely cool if ranger would offer this type of split rule.

As motivation, I have written an example with a publicly available data set with 50k+ rows. Here, choosing squared error loss leads to a negative performance with respect to the Poisson deviance. On the other hand, a very simple GLM with log-link maximizing the Poisson deviance has positive performance. The models are typically fitted by modelling the log expectation of the claim count divided by exposure with exposure as case weight.

In (my) package MetricsWeighted, the average Poisson deviance is implemented like this:

MetricsWeighted::deviance_poisson <- function(actual, predicted, w = NULL, ...)  {
    stopifnot(length(actual) == length(predicted), all(actual >= 
        0), all(predicted > 0))
    pos <- actual > 0
    predicted[pos] <- (actual * log(actual/predicted) - (actual - 
        predicted))[pos]
    weighted_mean(x = 2 * predicted, w = w, ...)
}

So, it is basically a weighted logarithmic difference.

Here the example with negative performance of ranger and positive performance of GLM on both the training and validation data.

library(insuranceData)
library(ranger)
library(MetricsWeighted)

data(dataCar)
head(dataCar)

# Calculate response (number of claims / exposure)
dataCar <- transform(dataCar, y = numclaims / exposure) 

# Data split
set.seed(324245)
n <- nrow(dataCar)
.in <- sample(nrow(dataCar), n * 0.8)
train <- dataCar[.in, ]
valid <- dataCar[-.in, ]

# Train models
form <- y ~ veh_value + veh_body + veh_age + gender + area + agecat
fit_rf <- ranger(form, data = train, case.weights = train$exposure)
fit_glm <- glm(form, data = train, weights = exposure, family = quasipoisson())

#=================================
# Performance on training data
#=================================

# Average Poisson deviance for intercept only / Null model
mu <- mean(train$y)
deviance_poisson(train$y, rep(mu, nrow(train)), train$exposure) # 0.823585

# Deviance worse than null model
deviance_poisson(train$y, fit_rf$predictions, train$exposure) # 0.8372331 
fit_rf # negative

# Deviance better than null model
deviance_poisson(train$y, fit_glm$fitted, train$exposure) # 0.7979

#=================================
# Same pic on validation data
#=================================

# Null model
deviance_poisson(valid$y, rep(mu, nrow(valid)), valid$exposure) # 0.8170347

# RF worse than null model
deviance_poisson(valid$y, predict(fit_rf, valid)$predictions, 
                 valid$exposure) # 0.8323863

# GLM better than null model
deviance_poisson(valid$y, predict(fit_glm, valid, type = "response"), 
                 valid$exposure) # 0.7946925
lorentzenchr commented 4 years ago

This would partly address/resolve #368.

lorentzenchr commented 4 years ago

@mnwright Can you point to the files or functions/classes where this feature would be implemented? In case someone wanted to try...

mnwright commented 4 years ago

The split rules for regression forests are here: https://github.com/imbs-hl/ranger/blob/578e2df497003f70b5ac144c02c341bfee8c021f/src/TreeRegression.cpp#L89

If you look e.g. at the "beta" split rule, you call findBestSplitBeta(), which in turn calls findBestSplitValueBeta() for every possible split variable.