h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

Gridable parameters may not be actually gridable and generate Exception: java.lang.NullPointerExceptionjava.lang.NullPointerException if set #9692

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I am trying to test GridSearch for GLM model (to start with). However, I run into inconsistency problems with parameter names (depending on which method you use to get the names) and gridable parameters not being really gridable. If you set one of those in your hyper_params dict, you will get a java.lang.NullPointerExceptionjava.lang.NullPointerException.

For a GLM model (for example), it has several parameters associated with it if you do:

  1. model._parms.keys(), you will get ['objective_epsilon', 'family', 'missing_values_handling', 'tweedie_link_power', 'nfolds', 'beta_constraints', 'checkpoint', 'compute_p_values', 'lambda_search', 'model_id', 'standardize', 'intercept', 'lambda_min_ratio', 'gradient_epsilon', 'alpha', 'non_negative', 'max_active_predictors', 'solver', 'beta_epsilon', 'nlambdas', 'fold_assignment', 'tweedie_variance_power', 'keep_cross_validation_predictions', 'prior', 'link', 'max_iterations', 'remove_collinear_columns', 'lambda']

  2. model._model_json["parameters"], you can extract gridable parameter names among other things: parameter names = ['validation_frame', 'fold_assignment', 'fold_column', 'response_column', 'offset_column', 'weights_column', 'tweedie_variance_power', 'tweedie_link_power', 'alpha', 'lambda', 'missing_values_handling', 'max_runtime_secs']

Note that the two sets are different. Using suggestion from Ludi, I took the intersection of the two and grab those parameters that are in both sets and are griddable. I ended up with the following list:

['fold_column', 'missing_values_handling', 'tweedie_link_power', 'fold_assignment', 'tweedie_variance_power', 'weights_column', 'alpha', 'lambda', 'offset_column']

However, out of the above list, the ones that are truly griddable are:

['missing_values_handling', 'tweedie_link_power', 'fold_assignment', 'tweedie_variance_power', 'alpha', 'lambda']

Talked to Michal and he agreed that gridsearch and algo implementers need to sit down and hash out what are the parameters that are truly gridable and set them accordingly for each algo.

  1. model.params.keys() will give yet another list that is different from the above: [u'objective_epsilon', u'fold_column', u'family', u'model_id', u'missing_values_handling', u'tweedie_link_power', u'keep_cross_validation_predictions', u'nfolds', u'max_after_balance_size', u'beta_constraints', u'training_frame', u'prior', u'max_runtime_secs', u'balance_classes', u'validation_frame', u'offset_column', u'lambda_search', u'response_column', u'standardize', u'ignored_columns', u'score_each_iteration', u'intercept', u'link', u'gradient_epsilon', u'alpha', u'max_confusion_matrix_size', u'non_negative', u'max_active_predictors', u'compute_p_values', u'solver', u'beta_epsilon', u'nlambdas', u'fold_assignment', u'tweedie_variance_power', u'weights_column', u'ignore_const_cols', u'max_hit_ratio_k', u'lambda_min_ratio', u'max_iterations', u'class_sampling_factors', u'remove_collinear_columns', u'lambda']

Of all the parameter lists, do we want to consolidate them somehow? Why so many parameter lists?

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2755 Assignee: Michal Malohlava Reporter: Wendy State: In Progress Fix Version: N/A Attachments: N/A Development PRs: N/A