h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

H2O should standardize categorical predictors (dummy/one-hot) for regularized GLM models (lasso, ridge, elastic net) #8523

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

The h2o.glm function includes a ‘standardize’ parameter that is true by default, and this standardizes continuous predictors. However, if predictors are stored as factors within the input H2OFrame, H2O does not standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:

For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).

However, per Tibshirani (1997), dummy or one-hot variables should also be standardized to enable fair regularization penalties:

{quote}"The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" (p. 394).

Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf{quote}

Statistical Considerations:

For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.

It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7117 Assignee: New H2O Bugs Reporter: Elliot State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A