h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Confusion on Warning when Scoring with Unimportant Features Excluded with GLM model #16363

Open ANNIKADAHLMANN-8451 opened 3 months ago

ANNIKADAHLMANN-8451 commented 3 months ago

H2O version, Operating System and Environment I am running H2O on Databricks with the following cluster settings:

and the following version:

Description We are training an H2O GeneralizedLinearEstimator model on a dataframe that has 100 columns and only 4 of which are actually used to compute y, the remaining features are independent of y (aka unimportant) by using the following code to generate the data:

from sklearn.datasets import make_friedman1
import pandas as pd

X, y = make_friedman1(n_samples=1000, n_features=100, random_state=8451)
df = pd.DataFrame(X)
df['y'] = y

df.head()

Since this is a model that does internal variable selection, we explored which features were actually deemed important by the model using model.varimp() and were curious what would happen when we scored using a subset of the data with only those relevant columns. When scoring only using the 4 columns that were relevant, we received the following warning message for all other columns that were deemed irrelevant:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-4602915f-61f7-4bcd-8c98-f8e3a654a43b/lib/python3.10/site-packages/h2o/job.py:81: UserWarning: Test/Validation dataset is missing column '5': substituting in a column of NaN
  warnings.warn(w)

Expected Behavior We would assume H2O would not need to append back the unimportant columns, but rather score on the subset of data (i.e. 4 vs. 100 columns) for speed and cost efficiencies.

Steps to reproduce Here is the code used to reproduce this warning message:

%pip install --quiet h2o

from sklearn.datasets import make_friedman1
import pandas as pd

X, y = make_friedman1(n_samples=1000, n_features=100, random_state=8451)
df = pd.DataFrame(X)
df['y'] = y

hdf = h2o.H2OFrame(df)
import h2o
h2o.init()
import h2o
from h2o.estimators import H2OGeneralizedLinearEstimator

predictors = hdf.columns
response = "y"
predictors.remove(response)

model = H2OGeneralizedLinearEstimator()
model.train(x=predictors, y=response, training_frame=hdf)
from sklearn.datasets import make_friedman1
import pandas as pd

X_tst, y_tst = make_friedman1(n_samples=1000, n_features=100, random_state=8452) # intentionally using a different random state to generate a different sample
tst = pd.DataFrame(X_tst)
tst['y'] = y_tst

X_tst_subset = tst[[0, 1, 3, 4]] # relevant features as revealed by model.varimp()
X_tst_subset_hf = h2o.H2OFrame(X_tst_subset)
subset_preds = mojo_model.predict(X_tst_subset_hf) # THIS IS WHAT TRIGGERS THE WARNING MESSAGE FOR EACH UNIMPORTANT COLUMN

Upload logs Output of h2o.download_all_logs() h2ologs_20240813_061208.zip

wendycwong commented 2 months ago

@ANNIKADAHLMANN-8451

Good point you have there. When the GLM coefficients for unimportant columns are zero, it should not need to have those columns present in the scoring dataset.

To help you get around this problem for the time being, please just add back those columns with random numbers so that the code will not complain.

As an alternative, you can use makeGLMModel to use include the four useful predictors and predictor coefficients and build a new GLM model with the 4 coefficients and then save it as a mojo that way. I will provide you with code later on how to do this.

This is not an easy issue to fix. The reason is we have an base object that deals with all kinds of models (GBM, GLM, GAM, DL, etc). For the other models, they don't have a concept of GLM coefficients. So, they will always take user inputs of predictors and response and build a model for it.

Again, some base objects are used to do the mojo which will read all models, their predictor names, response. So, when you just include the useful GLM predictors, it will freak out and say where are the other predictors and throw an error.

Thanks, Wendy

wendycwong commented 2 months ago

@ANNIKADAHLMANN-8451

Here is the complete code on how to build your model with many predictors and then generate a mojo with only the important predictors. Here are my steps:

  1. Generate data (copied from your code)
  2. Generate h2o model with 10 features (yours is 100 columns) as model1
  3. grab the coefficients of only 5 predictors (I pretend the other 5 are useless and they are zero)
  4. generate a new H2O model with only the 5 predictors grabbed from model1. I have to first generate a new glm model with only 5 predictors (model2) and then call makeGLMModel to generate a GLM model with the correct coefficients (model_with_good_predictors)
  5. save model_with_good_predictors to mojo;
  6. generate new test dataset with only 5 predictors;
  7. load mojo as generic model and generate prediction with new test dataset. You can use mojo predict. I use generic model to make it easy to compare prediction result.
  8. generate prediction with model_with_good_predictors
  9. compare the prediction from 7 and 8 and they should be the same.

Here is the complete code.

import sys import h2o from h2o.estimators.glm import H2OGeneralizedLinearEstimator as glm from sklearn.datasets import make_friedman1 import pandas as pd import tempfile from h2o.estimators import H2OGenericEstimator

X, y = make_friedman1(n_samples = 10000, n_features = 10, random_state=8451) df = pd.DataFrame(X) df['y'] = y hdf = h2o.H2OFrame(df) predictors = hdf.columns response = "y" predictors.remove(response)

model1 = glm() # model that use all coefficients model1.train(x=predictors, y=response, training_frame=hdf)

model2 = glm() # model that only uses 5 predictors because I pretend the other predictors are useless and have coeff = 0 model2.train(x=["0", "1", "2", "3", "4"], y=response, training_frame=hdf)

coef_model1 = model1.coef() # grab all coefficients from model1 coeff_dict = {"0": coef_model1["0"], "1": coef_model1["1"], "2": coef_model1["2"], "3": coef_model1["3"], "4": coef_model1["4"], "Intercept": coef_model1["Intercept"]} # grab coeffs we care

model_with_good_predictors = glm.makeGLMModel(model=model2, coefs=coeff_dict) # generate model with only 5 predictors and the coefficient values are from full model tmpdir = tempfile.mkdtemp() glm_mojo_model = model_with_good_predictors.download_mojo(tmpdir) # save to mojo

X, y = make_friedman1(n_samples = 100, n_features = 5, random_state=8452) df = pd.DataFrame(X) df['y'] = y hdf_test = h2o.H2OFrame(df) # generate test dataset with only 5 predictors

generic_mojo_glm_from_file = H2OGenericEstimator.from_file(glm_mojo_model) # load mojo as generic model predict_mojo = generic_mojo_glm_from_file.predict(hdf_test) predict_model = model_with_good_predictors.predict(hdf_test)

for ind in range(hdf_test.nrows): # if you check the contents of the two prediction frames, they should be the same. assert abs(predict_mojo[ind, 0]-predict_model[ind, 0]) < 1e-10

wendycwong commented 2 months ago

@ANNIKADAHLMANN-8451

Can you work with the code I sent you?