marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Can offsets be used in the natural way for poisson models? #34

Open samkodes opened 3 weeks ago

samkodes commented 3 weeks ago

Hi - I'm trying to understand whether offsets can be used for Poisson models in the natural way.

By this I mean that my observations have heterogenous "observation length", which is naturally modelled in Poisson regression by using an offset corresponding to the log of the observation length.

I would like to do the same thing in pre, but I can't figure out if this is possible. Ideally this would mean that both the lasso stage and rule generation stage would be able to use offsets. Possibly this would be more natural for the glmtree rule generation approach since no pseudo-observations are needed, and the "natural" offset can simply be added to the offset for each iteration?

Thanks for any advice you can give.

marjoleinF commented 3 weeks ago

Thanks for your message!

Passing offsets is now possible, with the current development version, provided that you:

1) Pass the offset argument to function pre as one would pass it to function cv.glmnet. That is, specify a vector of offset values, it should not refer to a column of data.

2) Specify use.grad = FALSE in the call to pre. This uses function partykit::glmtree for tree generation. Which has a natural way to deal with offsets. (The name of the use.grad argument is not well chosen, though; gradient boosting will be used with learnrate values $>0$, just in a slightly different manner (using the offset argument instead of explicitly computing pseudo-residuals / gradients.))

Note that the offset argument is not documented, it is passed through the ellipsis ...., both for rule induction and estimation of the final model. The reason for not documenting this (yet) is that the offsets will only affect tree induction when use.grad = FALSE. Otherwise, the offset will only be passed to cv.glmnet, which was previously also the case.

Here's an example:

> library("devtools")
> install_github("marjoleinF/pre") ## latest devel version
> library("pre")

> ## Create a dataset with Poisson outcome and offset that has an effect
> set.seed(42)
> x1 <- 10*round(rnorm(90))+20
> counts <- rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10) + x1
> outcome <- rep(gl(3, 1, 9), times = 10)
> treatment <- rep(gl(3, 3), times = 10)
> noise1 <- 1:90
> set.seed(1)
> noise2 <- rnorm(90)
> countdata <- data.frame(treatment, outcome, counts, noise1, noise2)
> 
> ## Fit PRE with offset
> set.seed(42)
> count.ens.offs <- pre(counts ~ ., data = countdata, offset = log(x1+20), 
+                       use.grad = FALSE, family = "poisson")
There were 50 or more warnings (use warnings() to see the first 50)
> ## warnings can be safely ignored, occur because of odd data generation and offset
> summary(count.ens.offs)

Final ensemble with cv error within 1se of minimum: 

  lambda =  0.2440189
  number of terms = 9
  mean cv error (se) = 0.4818379 (0.2008218)

  cv error type : Poisson Deviance

> length(count.ens.offs$rules$description) ## Count total number of generated rules
[1] 26
> predict(count.ens.offs, newdata = countdata[1:5, ], newoffset = x1[1:5])
        1         2         3         4         5 
29.972884  9.876221 19.905560 30.031489 19.829377 
> try(predict(count.ens.offs, newdata = countdata[1:5, ])) ## yields error, due to lack of offset
Error : No newoffset provided for prediction, yet offset used in fit of glmnet
> 
> ## Fit PRE without offset and compare
> set.seed(42)
> count.ens <- pre(counts ~ ., data = countdata, use.grad = FALSE, family = "poisson")
> summary(count.ens) ## Different set of rules and coefficients than with use of offset

Final ensemble with cv error within 1se of minimum: 

  lambda =  1.296296
  number of terms = 0
  mean cv error (se) = 4.570162 (0.9577355)

  cv error type : Poisson Deviance

> length(count.ens$rules$description) ## different set of rules generated than with offset
[1] 8
> predict(count.ens, newdata = countdata[1:5, ]) ## different predictions than with offset
       1        2        3        4        5 
3.616906 3.616906 3.616906 3.616906 3.616906 
samkodes commented 3 weeks ago

Thanks so much for the quick and detailed response. I will install the dev version and try it out!