topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

offset in xgbtree poisson regression #861

Closed SimonCoulombe closed 6 years ago

SimonCoulombe commented 6 years ago

Hi!

I am piggybacking on this issue (https://github.com/topepo/caret/issues/507) which asked questions about both offsets and early_stopping and the answer was that one would need to create a custom method.

I am re-opening this question, because it appears that offsets ("base_margins") are used by caret when we feed caret::train a "xgb.Dmatrix" object. At least, the predictions are very affected by the offsets.

However, it appears that the offsets are not used when saving the out-of-fold predictions generated using kfold.

If that is correct, is there are way to extract them? Maybe multiplying the result by offset/ mean(offset) ?

Here is an example code, widely inspired by the code at issue #507.

80bd97f9-7d60-4479-bfa3-ad9fd248abe1

library(xgboost)
library(dplyr)
library(caret)
library(insuranceData) # example dataset https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
library(pdp) # partial dependency plots
library(lime) # Local Interpretable Model-Agnostic Explanations
library(gridExtra)
set.seed(123)
data(dataCar)
mydb <- dataCar %>% select(clm, exposure, veh_value, veh_body,
                           veh_age, gender, area, agecat)

label_var <- "clm"  
offset_var <- "exposure"
feature_vars <- mydb %>% 
  select(-one_of(c(label_var, offset_var))) %>% 
  colnames()

#preparing data for xgboost (one hot encoding of categorical (factor) data
myformula <- paste0( "~", paste0( feature_vars, collapse = " + ") ) %>% as.formula()
dummyFier <- caret::dummyVars(myformula, data=mydb, fullRank = TRUE)
dummyVars.df <- predict(dummyFier,newdata = mydb)
mydb_dummy <- cbind(mydb %>% select(one_of(c(label_var, offset_var))), 
                    dummyVars.df)
rm(myformula, dummyFier, dummyVars.df)

feature_vars_dummy <-  mydb_dummy  %>% select(-one_of(c(label_var, offset_var))) %>% colnames()

mydb_xgbmatrix <- xgb.DMatrix(
  data = mydb_dummy %>% select(feature_vars_dummy) %>% as.matrix, 
  label = mydb_dummy %>% pull(label_var),
  missing = "NAN")

#first issue with caret: specifying a base margin to the matrices
#base_margin: base margin is the base prediction Xgboost will boost from 
setinfo(mydb_xgbmatrix,"base_margin", mydb %>% pull(offset_var) %>% log() )

myConstraint   <- data_frame(Variable = feature_vars_dummy) %>%
  mutate(direction = ifelse(Variable == "veh_value", 1, 0))

myParam <- list(
  xgb.train.booster = "gbtree",
  eta = .01,
  gamma = 0.001,
  max.depth = 3,
  min_child_weight=1,
  subsample = 0.8,
  colsample_bytree = 0.8,
  monotone_constraints= myConstraint$direction,
  objective = 'count:poisson',
  eval_metric = "poisson-nloglik")

xgb.train.booster <- xgb.train(
  params = myParam, 
  data = mydb_xgbmatrix, 
  nround = 300)

#k-fold cross validation:
trControl = caret::trainControl(
  method = 'cv',
  number = 3,
  verboseIter = TRUE,
  allowParallel = TRUE, 
  savePredictions=TRUE) # save out of fold predictions.
#WARNING:  predictions out of fold pas en ordre, utiliser rowIndex pour les ordonner.

# create the tuning grid.
tuneGridXGB <- expand.grid(
  nrounds=c(300),
  eta = c(0.01),
  max_depth = c(3),
  gamma = c(0.001),
  colsample_bytree = c(0.8),
  min_child_weight = c(1),
  subsample = c(0.8 ))

# train the xgboost learner
caret.booster <- caret::train(
  x = mydb_xgbmatrix, 
  y = mydb_dummy %>% pull(label_var),
  method = 'xgbTree',
  objective = 'count:poisson',
  eval_metric = "poisson-nloglik",
  monotone_constraints= myConstraint$direction,
  trControl = trControl,
  tuneGrid = tuneGridXGB)

mydb$pred <- predict(xgb.train.booster, newdata = mydb_xgbmatrix) #xgb.train preds
mydb$pred_caret <- predict(caret.booster, newdata = mydb_xgbmatrix) #caret preds
mydb$oof_pred_caret <- caret.booster$pred %>% arrange(rowIndex) %>% pull(pred) #out of fold preds

p1 <- ggplot(mydb) +
  geom_point(aes(x=exposure, y= pred), alpha = 1/150)

p2 <- ggplot(mydb) +
  geom_point(aes(x=exposure, y= pred_caret), alpha = 1/150)

p3 <- ggplot(mydb) +
  geom_point(aes(x=exposure, y= oof_pred_caret), alpha = 1/150)

p4 <- ggplot(mydb) +
  geom_point(aes(x=exposure, y= oof_pred_caret * exposure/ mean(exposure)), alpha = 1/150)

grid.arrange(p1 , p2 , p3, p4, ncol=2)
o1iv3r commented 6 years ago

One can look at the caret implementation of xgboost via

caret::getModelInfo("xgbTree", FALSE)[[1]]

I do not see that any offset is used explicitly. But one idea might be to define your own xgboost method that uses an additional parameter:

my_xgbTree <- caret::getModelInfo("xgbTree", FALSE)[[1]]
my_xgbTree$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
#same code as caret::getModelInfo("xgbTree", FALSE)[[1]]$fit but with the additional lines
theDots <- list(...)
offset <- theDots$offset 
if (!is.null(offset))
xgboost::setinfo(x, 'base_margin', offset)
} 

offset is a parameter that you provide with train() and will therefore be included in theDots

train{...,method=my_xgbTree ,offset=...,

I'd suggest to try the methods on simulated data so that you know when it is working or not

SimonCoulombe commented 6 years ago

thanks, took me a longer while than I care to admit, but I got it :)

o1iv3r commented 6 years ago

After trying this out myself I realized that this is not right because the offset is not sorted correctly (rows will be permuted due to the resampling). I did not find any other solution than to use weights ("wts") in order to handle the offset. Unfortunately there seems to be no better solution in caret :(

I suggest to check your results with a simple Poisson GLM that requires no training. If indeed train() and glm() give the same parameters, I'd be happy to see your solution :)

topepo commented 6 years ago

The newer modeling system that we're developing would be able to handle this case. recipes would allow you to carry the offset variable throughout the resampling process. This would make a good case study for that.

I'm at a conference right now, but I'll try to use your example to demonstrate how this works in a vignette. The new system doesn't have parameter tuning implemented yet though...

topepo commented 6 years ago

Out of curiosity, why use a Poisson assumption when the data are binary? Did you mean to use numclaims?

Also, can you explain what the purpose of myConstraint is?

SimonCoulombe commented 6 years ago

Hi Max, Two good questions :)

I should have used numclaims indeed, my bad. I made that mistake because of the way another dataset I use is constructed.

The monotone_constraints does not help in this example because there is no reason for the frequency of crashs to increase with vehicule value. It would be very useful if there was a"miles driven per week" variable instead, or if we were modelling the severity (dollar value) of the claims.

best, Simon

SimonCoulombe commented 6 years ago

The newer modeling system that we're developing would be able to handle this case. recipes would allow you to carry the offset variable throughout the resampling process. This would make a good case study for that.

I'm at a conference right now, but I'll try to use your example to demonstrate how this works in a vignette. The new system doesn't have parameter tuning implemented yet though...

Hi Max, I'm wondering if you ended up doing that vignette. cheers

topepo commented 6 years ago

Nope. Not yet. Just created an issue though. tidymodels/recipes#210