Open exalate-issue-sync[bot] opened 1 year ago
Wendy Wong commented: Basically here is the problem:
The parameter called {{ignored_columns}} (see [+link+|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/ignored_columns.html]) helps user to keep a feature that you want to be ignored when building a model.
When I build a simple ML model and analyze the feature importance, I can see that {{h2o}}ignores the column that I speficied during the training process, which can be observed from the feature importance. As shown below, column {{c}} is not used during training.
{noformat}import pandas as pd import h2o from h2o.estimators import H2ODeepLearningEstimator from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
x = pd.DataFrame([[0, 1, 4], [5, 1, 6], [15, 2, 0], [25, 5 , 32], [35, 11 ,89], [45, 15, 1], [55, 34,3], [60, 35,4]], columns = ['a','b','c']) y = pd.DataFrame([4, 5, 20, 14, 32, 22, 38, 43], columns = ['label']) hf = h2o.H2OFrame( pd.concat([x,y], axis="columns"))
X = hf.col_names[:-1]
y = hf.col_names[-1]
model= H2ORandomForestEstimator(ignored_columns = ['c'])
model.train(y = y, training_frame=hf) model.varimp(use_pandas=True)
variable relative_importance scaled_importance percentage 0 b 33876.328125 1.000000 0.540893 1 a 28753.998047 0.848793 0.459107 {noformat}
However, when I turn on the grid search for the hyper parameter tunning, it does not seem like working.
{noformat}params = {'max_depth': list(range(7, 16)), 'sample_rate': [0.8], } criteria = {'strategy': 'RandomDiscrete', 'max_models': 4} grid = H2OGridSearch(model= H2ORandomForestEstimator(ignored_columns = ['c']), search_criteria=criteria, hyper_params=params )
grid.train( y = y, training_frame=hf) best_model = grid.get_grid(sort_by='rmse', decreasing=False)[0] best_model.varimp(use_pandas=True) variable relative_importance scaled_importance percentage 0 a 33525.109375 1.000000 0.516545 1 b 23314.916016 0.695446 0.359230 2 c 8062.515137 0.240492 0.124225{noformat}
JIRA Issue Details
Jira Issue: PUBDEV-9006 Assignee: Erin LeDell Reporter: Wendy Wong State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
Wendy Wong commented: Here is the link to SO:
[https://stackoverflow.com/questions/75473825/ignored-column-is-not-working-when-using-grid-search-in-h2o-library-python|https://stackoverflow.com/questions/75473825/ignored-column-is-not-working-when-using-grid-search-in-h2o-library-python|smart-link]