Open ANNIKADAHLMANN-8451 opened 3 months ago
@ANNIKADAHLMANN-8451
Good point you have there. When the GLM coefficients for unimportant columns are zero, it should not need to have those columns present in the scoring dataset.
To help you get around this problem for the time being, please just add back those columns with random numbers so that the code will not complain.
As an alternative, you can use makeGLMModel to use include the four useful predictors and predictor coefficients and build a new GLM model with the 4 coefficients and then save it as a mojo that way. I will provide you with code later on how to do this.
This is not an easy issue to fix. The reason is we have an base object that deals with all kinds of models (GBM, GLM, GAM, DL, etc). For the other models, they don't have a concept of GLM coefficients. So, they will always take user inputs of predictors and response and build a model for it.
Again, some base objects are used to do the mojo which will read all models, their predictor names, response. So, when you just include the useful GLM predictors, it will freak out and say where are the other predictors and throw an error.
Thanks, Wendy
@ANNIKADAHLMANN-8451
Here is the complete code on how to build your model with many predictors and then generate a mojo with only the important predictors. Here are my steps:
import sys import h2o from h2o.estimators.glm import H2OGeneralizedLinearEstimator as glm from sklearn.datasets import make_friedman1 import pandas as pd import tempfile from h2o.estimators import H2OGenericEstimator
X, y = make_friedman1(n_samples = 10000, n_features = 10, random_state=8451) df = pd.DataFrame(X) df['y'] = y hdf = h2o.H2OFrame(df) predictors = hdf.columns response = "y" predictors.remove(response)
model1 = glm() # model that use all coefficients model1.train(x=predictors, y=response, training_frame=hdf)
model2 = glm() # model that only uses 5 predictors because I pretend the other predictors are useless and have coeff = 0 model2.train(x=["0", "1", "2", "3", "4"], y=response, training_frame=hdf)
coef_model1 = model1.coef() # grab all coefficients from model1 coeff_dict = {"0": coef_model1["0"], "1": coef_model1["1"], "2": coef_model1["2"], "3": coef_model1["3"], "4": coef_model1["4"], "Intercept": coef_model1["Intercept"]} # grab coeffs we care
model_with_good_predictors = glm.makeGLMModel(model=model2, coefs=coeff_dict) # generate model with only 5 predictors and the coefficient values are from full model tmpdir = tempfile.mkdtemp() glm_mojo_model = model_with_good_predictors.download_mojo(tmpdir) # save to mojo
X, y = make_friedman1(n_samples = 100, n_features = 5, random_state=8452) df = pd.DataFrame(X) df['y'] = y hdf_test = h2o.H2OFrame(df) # generate test dataset with only 5 predictors
generic_mojo_glm_from_file = H2OGenericEstimator.from_file(glm_mojo_model) # load mojo as generic model predict_mojo = generic_mojo_glm_from_file.predict(hdf_test) predict_model = model_with_good_predictors.predict(hdf_test)
for ind in range(hdf_test.nrows): # if you check the contents of the two prediction frames, they should be the same. assert abs(predict_mojo[ind, 0]-predict_model[ind, 0]) < 1e-10
@ANNIKADAHLMANN-8451
Can you work with the code I sent you?
H2O version, Operating System and Environment I am running H2O on Databricks with the following cluster settings:
and the following version:
Description We are training an H2O GeneralizedLinearEstimator model on a dataframe that has 100 columns and only 4 of which are actually used to compute
y
, the remaining features are independent ofy
(aka unimportant) by using the following code to generate the data:Since this is a model that does internal variable selection, we explored which features were actually deemed important by the model using
model.varimp()
and were curious what would happen when we scored using a subset of the data with only those relevant columns. When scoring only using the 4 columns that were relevant, we received the following warning message for all other columns that were deemed irrelevant:Expected Behavior We would assume H2O would not need to append back the unimportant columns, but rather score on the subset of data (i.e. 4 vs. 100 columns) for speed and cost efficiencies.
Steps to reproduce Here is the code used to reproduce this warning message:
%pip install --quiet h2o
Upload logs Output of
h2o.download_all_logs()
h2ologs_20240813_061208.zip