H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
This was observed by one of our users and Erin LeDell has alerted me to this problem. Here is the details:
I am using:
h2o 3.16.0.2
Mac OS Sierra 10.12.3
R 3.4.1
I posted this question on cross validated (https://stats.stackexchange.com/q/329405/155462) as I wasn't sure it was a programming question. In order to avoid duplicating on stack overflow i'm posting here. (responses also indicated the problem may be specific to my h2o code).
I am fitting a poisson GLM to model claim rates. Since I have 1.5m+ records, I have aggregated my data (to improve efficiency). My understanding is that using aggregated data with a poisson GLM will not effect the coefficient estimates. Indeed, if I fit a GLM with no regularization (and using the log of exposure as an offset) I can exactly recreate the same coefficient estimates using either the full dataset or aggregated data.
However, as soon as I introduce a lambda penalty, the coefficients change depending on whether I use full individual or aggregated data. The larger the penalty, the larger the difference in coefficients.
My question is - should I be able to get the same coefficient estimates using aggregated data with a lambda penalty as I do with individual data? And finally, if not, is it 'ok' to use regularization with aggregated data?
Here is my code:
library(simstudy)
set.seed(123)
simulate a dataset with 2 variables - 'Male' and 'Target'
This was observed by one of our users and Erin LeDell has alerted me to this problem. Here is the details:
I am using:
I posted this question on cross validated (https://stats.stackexchange.com/q/329405/155462) as I wasn't sure it was a programming question. In order to avoid duplicating on stack overflow i'm posting here. (responses also indicated the problem may be specific to my h2o code).
I am fitting a poisson GLM to model claim rates. Since I have 1.5m+ records, I have aggregated my data (to improve efficiency). My understanding is that using aggregated data with a poisson GLM will not effect the coefficient estimates. Indeed, if I fit a GLM with no regularization (and using the log of exposure as an offset) I can exactly recreate the same coefficient estimates using either the full dataset or aggregated data.
However, as soon as I introduce a lambda penalty, the coefficients change depending on whether I use full individual or aggregated data. The larger the penalty, the larger the difference in coefficients.
My question is - should I be able to get the same coefficient estimates using aggregated data with a lambda penalty as I do with individual data? And finally, if not, is it 'ok' to use regularization with aggregated data?
Here is my code:
library(simstudy) set.seed(123)
simulate a dataset with 2 variables - 'Male' and 'Target'
def = defData(varname = "Male", dist = "binary", formula = 0.4) def = defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")
generate the dataset
dt = genData(50000,def)
variable to indicate that each row represents 1 exposure period
dt$n = 1
dt$Male = as.factor(dt$Male) dt = as.data.frame(dt)
create an aggregated version of data
dt2 = aggregate(x = dt[c('Target','n')], by = list(Male = dt$Male), FUN = sum)
add an offset column to aggregated data - log of exposure
dt2$offset = log(dt2$n)
initialise h2o
library(h2o) h2o.init(nthreads = -1) dt = as.h2o(dt) dt2 = as.h2o(dt2)
fit GLM to full data, no regularization
mod1 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0,alpha = 0,seed = 123) round(mod1@model$coefficients,5)
Intercept Male.1
-0.35798 -0.83271
mod2 - use aggregated data with an offset
mod2 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0,alpha = 0,offset_column = 'offset',seed = 123) round(mod2@model$coefficients,5)
Intercept Male.1
-0.35798 -0.83271
now repeat, with regularization. First, full data
mod3 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0.01,alpha = 0,seed = 123) round(mod3@model$coefficients,5)
Intercept Male.0 Male.1
-0.76286 0.39547 -0.39547
now on aggregated data
mod4 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0.01,alpha = 0,offset_column = 'offset',seed = 123) round(mod4@model$coefficients,5)
Intercept Male.0 Male.1
-0.77433 0.41635 -0.41635
Thanks in advance
Josh