H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
A list of cleanup/improvements tasks for the Py client to improve code quality, modularity, maintainability.
method {{convert_H2OXGBoostParams_2_XGBoostParams}} should not be in {{estimator_base.py}} but most probably in {{xgboost.py}}. There it could advantageously be renamed (keeping current name as alias for backwards compatibility) to {{as_native_xgboost_params}}.
Following the same logic, {{H2OFrame}} method {{convert_H2OFrame_2_DMatrix}} could be advantageously renamed to {{as_dmatrix}}… repeating {{H2OFrame}} the method name is just redundant.
As a general rule, we can wonder if it’s the best choice to implement those converters as methods: these are mainly integration features that require external dependencies. It would probably be better to group all those integrations into utility packages instead. Those 2 methods are explicitely aiming at native xgboost integration, and should therefore be in package named for example {{h2o.support.xgboost}}
there are still some references to specific algos in {{estimator_base.py}}: those should preferably be replaced by a declarative approach (class attributes or class methods: for example {{_algofeatures = ['verbosity']}}, {{_algocategory = 'supervised'}},
{{_compute_algo}} method in {{estimator_base.py}} is apparently unused: verify and delete. If used, replace with usage of class attribute {{algo}}.
plots/visualizations scatterred all over the place ({{model_base}}, {{metrics_base}}, …), not always configurable (colors, sizes, font style), with hardcoded logic for specific algos: would rather have a dedicated {{h2o.plotting}} package encapsulating all this logic. For backwards compatibility, existing plotting functions could delegate plots to this new package.
most of our {{repr}} implementations call {{self.show()}} which itself prints to stdout. This is pure evil!
{{repr()}} should RETURN a parsable/formal string representation of the current object, not print anything.
In a similar was {{str}} should also not print anything as {{str()}} should return a pretty/informal representation of the object.
We MUST reserve calls to {{print()}} to our {{summary()}} and {{show()}} methods.
modularity of client is getting worse, we should be careful there: monolithic root module {{h2o}} where anything goes, meaningless packages with one function ({{persist}}), {{h2o.utils}} package that contains custom logic ({{distributions}}) , modules like {{targetencoder}}, {{cross_validation}} at root level…. it’s almost impossible to identify any structure in our Python package layout.
split {{h2o}} into meaningful submodules and just import functions there.
instead of {{persist}} , couldn’t we have a featured {{admin}} or {{remote}} package, dedicated to users using the Py client as a “pure” client (backend running remotely)?
custom metrics, custom distributions…, which are Py extensions intended to run on the backend as jython scripts should be in a dedicated package, not in {{h2o.utils}}…
package {{h2o.utils.csv}} should more accurately be named {{h2o.fixes.csv}} or {{h2o.compat.csv}}.
** {{h2o.utils.shared_utils}} is evil… contains a bunch of totally unrelated stuff, from time utility functions to mojo helpers, wtf?
Verify possible existence of tasks for those concerns, and create one if none available yet.
Assign all as sub-tasks.
A list of cleanup/improvements tasks for the Py client to improve code quality, modularity, maintainability.
Verify possible existence of tasks for those concerns, and create one if none available yet. Assign all as sub-tasks.