h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

Ignored columns not used during gridsearch for randomforest and possibly other algos as well #6502

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Here is the link to SO:

[https://stackoverflow.com/questions/75473825/ignored-column-is-not-working-when-using-grid-search-in-h2o-library-python|https://stackoverflow.com/questions/75473825/ignored-column-is-not-working-when-using-grid-search-in-h2o-library-python|smart-link]

exalate-issue-sync[bot] commented 1 year ago

Wendy Wong commented: Basically here is the problem:

The parameter called {{ignored_columns}} (see [+link+|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/ignored_columns.html]) helps user to keep a feature that you want to be ignored when building a model.

When I build a simple ML model and analyze the feature importance, I can see that {{h2o}}ignores the column that I speficied during the training process, which can be observed from the feature importance. As shown below, column {{c}} is not used during training.

{noformat}import pandas as pd import h2o from h2o.estimators import H2ODeepLearningEstimator from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.random_forest import H2ORandomForestEstimator

h2o.init()

x = pd.DataFrame([[0, 1, 4], [5, 1, 6], [15, 2, 0], [25, 5 , 32], [35, 11 ,89], [45, 15, 1], [55, 34,3], [60, 35,4]], columns = ['a','b','c']) y = pd.DataFrame([4, 5, 20, 14, 32, 22, 38, 43], columns = ['label']) hf = h2o.H2OFrame( pd.concat([x,y], axis="columns"))

X = hf.col_names[:-1]
y = hf.col_names[-1]

model= H2ORandomForestEstimator(ignored_columns = ['c'])

model.train(y = y, training_frame=hf) model.varimp(use_pandas=True)

variable relative_importance scaled_importance percentage 0 b 33876.328125 1.000000 0.540893 1 a 28753.998047 0.848793 0.459107 {noformat}

However, when I turn on the grid search for the hyper parameter tunning, it does not seem like working.

{noformat}params = {'max_depth': list(range(7, 16)), 'sample_rate': [0.8], } criteria = {'strategy': 'RandomDiscrete', 'max_models': 4} grid = H2OGridSearch(model= H2ORandomForestEstimator(ignored_columns = ['c']), search_criteria=criteria, hyper_params=params )

grid.train( y = y, training_frame=hf) best_model = grid.get_grid(sort_by='rmse', decreasing=False)[0] best_model.varimp(use_pandas=True) variable relative_importance scaled_importance percentage 0 a 33525.109375 1.000000 0.516545 1 b 23314.916016 0.695446 0.359230 2 c 8062.515137 0.240492 0.124225{noformat}

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-9006 Assignee: Erin LeDell Reporter: Wendy Wong State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A