H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
I am trying to test GridSearch for GLM model (to start with). However, I run into inconsistency problems with parameter names (depending on which method you use to get the names) and gridable parameters not being really gridable. If you set one of those in your hyper_params dict, you will get a java.lang.NullPointerExceptionjava.lang.NullPointerException.
For a GLM model (for example), it has several parameters associated with it if you do:
model._model_json["parameters"], you can extract gridable parameter names among other things:
parameter names = ['validation_frame', 'fold_assignment', 'fold_column', 'response_column', 'offset_column', 'weights_column', 'tweedie_variance_power', 'tweedie_link_power', 'alpha', 'lambda', 'missing_values_handling', 'max_runtime_secs']
Note that the two sets are different. Using suggestion from Ludi, I took the intersection of the two and grab those parameters that are in both sets and are griddable. I ended up with the following list:
Talked to Michal and he agreed that gridsearch and algo implementers need to sit down and hash out what are the parameters that are truly gridable and set them accordingly for each algo.
model.params.keys() will give yet another list that is different from the above:
[u'objective_epsilon', u'fold_column', u'family', u'model_id', u'missing_values_handling', u'tweedie_link_power', u'keep_cross_validation_predictions', u'nfolds', u'max_after_balance_size', u'beta_constraints', u'training_frame', u'prior', u'max_runtime_secs', u'balance_classes', u'validation_frame', u'offset_column', u'lambda_search', u'response_column', u'standardize', u'ignored_columns', u'score_each_iteration', u'intercept', u'link', u'gradient_epsilon', u'alpha', u'max_confusion_matrix_size', u'non_negative', u'max_active_predictors', u'compute_p_values', u'solver', u'beta_epsilon', u'nlambdas', u'fold_assignment', u'tweedie_variance_power', u'weights_column', u'ignore_const_cols', u'max_hit_ratio_k', u'lambda_min_ratio', u'max_iterations', u'class_sampling_factors', u'remove_collinear_columns', u'lambda']
Of all the parameter lists, do we want to consolidate them somehow? Why so many parameter lists?
I am trying to test GridSearch for GLM model (to start with). However, I run into inconsistency problems with parameter names (depending on which method you use to get the names) and gridable parameters not being really gridable. If you set one of those in your hyper_params dict, you will get a java.lang.NullPointerExceptionjava.lang.NullPointerException.
For a GLM model (for example), it has several parameters associated with it if you do:
model._parms.keys(), you will get ['objective_epsilon', 'family', 'missing_values_handling', 'tweedie_link_power', 'nfolds', 'beta_constraints', 'checkpoint', 'compute_p_values', 'lambda_search', 'model_id', 'standardize', 'intercept', 'lambda_min_ratio', 'gradient_epsilon', 'alpha', 'non_negative', 'max_active_predictors', 'solver', 'beta_epsilon', 'nlambdas', 'fold_assignment', 'tweedie_variance_power', 'keep_cross_validation_predictions', 'prior', 'link', 'max_iterations', 'remove_collinear_columns', 'lambda']
model._model_json["parameters"], you can extract gridable parameter names among other things: parameter names = ['validation_frame', 'fold_assignment', 'fold_column', 'response_column', 'offset_column', 'weights_column', 'tweedie_variance_power', 'tweedie_link_power', 'alpha', 'lambda', 'missing_values_handling', 'max_runtime_secs']
Note that the two sets are different. Using suggestion from Ludi, I took the intersection of the two and grab those parameters that are in both sets and are griddable. I ended up with the following list:
['fold_column', 'missing_values_handling', 'tweedie_link_power', 'fold_assignment', 'tweedie_variance_power', 'weights_column', 'alpha', 'lambda', 'offset_column']
However, out of the above list, the ones that are truly griddable are:
['missing_values_handling', 'tweedie_link_power', 'fold_assignment', 'tweedie_variance_power', 'alpha', 'lambda']
Talked to Michal and he agreed that gridsearch and algo implementers need to sit down and hash out what are the parameters that are truly gridable and set them accordingly for each algo.
Of all the parameter lists, do we want to consolidate them somehow? Why so many parameter lists?