h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

GLM poisson family return different coefficients when regularization is enabled #12263

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

This was observed by one of our users and Erin LeDell has alerted me to this problem. Here is the details:

I am using:

I posted this question on cross validated (https://stats.stackexchange.com/q/329405/155462) as I wasn't sure it was a programming question. In order to avoid duplicating on stack overflow i'm posting here. (responses also indicated the problem may be specific to my h2o code).

I am fitting a poisson GLM to model claim rates. Since I have 1.5m+ records, I have aggregated my data (to improve efficiency). My understanding is that using aggregated data with a poisson GLM will not effect the coefficient estimates. Indeed, if I fit a GLM with no regularization (and using the log of exposure as an offset) I can exactly recreate the same coefficient estimates using either the full dataset or aggregated data.

However, as soon as I introduce a lambda penalty, the coefficients change depending on whether I use full individual or aggregated data. The larger the penalty, the larger the difference in coefficients.

My question is - should I be able to get the same coefficient estimates using aggregated data with a lambda penalty as I do with individual data? And finally, if not, is it 'ok' to use regularization with aggregated data?

Here is my code:

library(simstudy) set.seed(123)

simulate a dataset with 2 variables - 'Male' and 'Target'

def = defData(varname = "Male", dist = "binary", formula = 0.4) def = defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")

generate the dataset

dt = genData(50000,def)

variable to indicate that each row represents 1 exposure period

dt$n = 1

dt$Male = as.factor(dt$Male) dt = as.data.frame(dt)

create an aggregated version of data

dt2 = aggregate(x = dt[c('Target','n')], by = list(Male = dt$Male), FUN = sum)

add an offset column to aggregated data - log of exposure

dt2$offset = log(dt2$n)

initialise h2o

library(h2o) h2o.init(nthreads = -1) dt = as.h2o(dt) dt2 = as.h2o(dt2)

fit GLM to full data, no regularization

mod1 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0,alpha = 0,seed = 123) round(mod1@model$coefficients,5)

Intercept Male.1

-0.35798 -0.83271

mod2 - use aggregated data with an offset

mod2 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0,alpha = 0,offset_column = 'offset',seed = 123) round(mod2@model$coefficients,5)

Intercept Male.1

-0.35798 -0.83271

now repeat, with regularization. First, full data

mod3 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0.01,alpha = 0,seed = 123) round(mod3@model$coefficients,5)

Intercept Male.0 Male.1

-0.76286 0.39547 -0.39547

now on aggregated data

mod4 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0.01,alpha = 0,offset_column = 'offset',seed = 123) round(mod4@model$coefficients,5)

Intercept Male.0 Male.1

-0.77433 0.41635 -0.41635

Thanks in advance

Josh

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5396 Assignee: New H2O Bugs Reporter: Wendy State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A