H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
There are many new changes in the H2O ecosystem, and we are working furiously to publish and share these with the community.
In this context, we are preparing a new H2O release 3.12 with amazing features (e.g., AutoML, XGBoost support). We are also planning some changes that can affect existing code bases. This email is meant to inform you and start discussions about them.
The changes include:
migrating from Java 6 to Java 7
modularization of the code base with the help of Java Service Provider Interface (SPI) instead of using reflections library
Improvement of Stacked Ensemble API
New feature: Automatic Machine Learning (AutoML)
h2. Migration from Java 6 to Java 7
h3. Motivation
Security concerns with using old libraries to keep compatibility with Java 6
h3. Actions
We will remove Java 6 support from the H2O build chain including:
removal of artifact byte code rewriting from Java 7 to Java 6
upgrading Animal Sniffer signature to Java 7
We are going to publish only Java 7 compatible binary artifacts to Maven Central.
We are going to use only Java 7 compatible syntax in our source code base. The only exception is h2o-genmodel module,
which we will try to keep close to Java 6 syntax.
h3. Impact of change
If your stack is running on top of Java 6 JVM (e.g., old Hadoop distribution, proprietary tools), then H2O will stop working.
Please let us know!
We would like to provide more flexible system to extend H2O and plug new tools into the H2O platform (e.g, XGBoost, TensorFlow, Sparkling Water).
The current code base is using [reflections library|https://github.com/ronmamo/reflections] to handle lookup of optional components, however it brings several issues including:
limit on used package name by extension (only water and hex are allowed)
force traversal of full classpath which causes problems in systems with dynamic classloaders (e.g., Spark executors).
kk{{monospaced text}}
h3. Actions
We will remove usage of reflections library to find instances of water.AbstractH2OExtension, water.api.AbstractRegister and water.api.Schema
The extensions (meaning classes listed in the previous point) will be registered using [Java Service Provider Interface|https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html]. In short, the concept relies on service files that are located in META-INF/services directory. Each service file is called by the name of a class it extends (e.g., water.AbstractH2OExtension) and contains a list of classes that extend the service class. For example, for core H2O REST API we have a single file h2o-core/src/main/resources/META-INF/services/water.api.RestApiExtension which contains 3 REST API extensions implementing interface water.api.RestApiExtension:
{code}
water.api.RegisterResourceRoots
water.api.RegisterV3Api
water.api.RegisterV4Api
{code}
We provide capabilities REST end-point to provide list of registered core extensions, REST API extensions, parsers (WIP)
We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend and fails gracefully if the user invokes an operation that is not provided by backend
Note: the same concept is already used in H2O to register parsers and Rapids extensions.
h3. Impact of change
Code that register new REST API calls by extending water.api.AbstractRegister class will need to be updated by adding a service file as described above
Each class extending water.api.Schema needs to be registered as well in water.api.Schema service file.
We have some unnecessary arguments in the Stacked Ensembles API that should be deprecated. In particular, the current API requires the training set, when the algorithm doesn't actually require it. It only requires a single-colum response frame, so we are requiring the user to load more data into memory than is required to train the ensemble.
h3. Actions
Overview of the current arguments and what should be done about each one:
x: Not needed and not being used at all (should be removed).
y: Only required if we keep training_frame, since we need to grab the response column data for training the metalearner and if we have the whole training frame, then this is the only way to identify which is the response column.
training_frame: Having the whole training frame is not required. We actually only need just the response column. This would be replaced with response_column.
model_id: This is the id for the "Stacked Ensemble" model.
validation_frame: Keep as is.
base_models: Keep as is, but we need to relax the restrictions on these models.
selection_strategy: This doesn't do anything, so we should remove it from the R/Python API until it does something other than use all the models each time.
Old code will still work. We will add the ... in R and kwargs in Python to handle (and properly map) the extra arguments automatically.
h3. Preview of changes
The JIRA is [PUBDEV-4240]
h2. New feature: Automatic Machine Learning (AutoML)
h3. Motivation
We have designed an easy-to-use interface which automates the process of training a large selection of candidate models, and also creating ensembles of these models.
H2O’s AutoML provides a simple wrapper function that performs a large number of modeling-related tasks (which would typically require many lines of code).
h3. Actions
Added new functions to R and Python to enable AutoML in H2O.
The AutoML object includes a leaderboard: models are ranked by a specific model performance metric.
h3. Impact of change
This is a new feature, so it doesn't impact your code.
Hopefully, this saves users a lot of time in their modeling and ensembling efforts.
Hello H2O community!
There are many new changes in the H2O ecosystem, and we are working furiously to publish and share these with the community.
In this context, we are preparing a new H2O release 3.12 with amazing features (e.g., AutoML, XGBoost support). We are also planning some changes that can affect existing code bases. This email is meant to inform you and start discussions about them.
The changes include:
h2. Migration from Java 6 to Java 7 h3. Motivation
h3. Actions
h2o-genmodel
module, which we will try to keep close to Java 6 syntax.h3. Impact of change If your stack is running on top of Java 6 JVM (e.g., old Hadoop distribution, proprietary tools), then H2O will stop working. Please let us know!
h3. Preview of changes
h2. Modularization of code base
h3. Motivation
We would like to provide more flexible system to extend H2O and plug new tools into the H2O platform (e.g, XGBoost, TensorFlow, Sparkling Water).
The current code base is using [reflections library|https://github.com/ronmamo/reflections] to handle lookup of optional components, however it brings several issues including:
water
andhex
are allowed)We will remove usage of reflections library to find instances of
water.AbstractH2OExtension
,water.api.AbstractRegister
andwater.api.Schema
The extensions (meaning classes listed in the previous point) will be registered using [Java Service Provider Interface|https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html]. In short, the concept relies on service files that are located in
META-INF/services
directory. Each service file is called by the name of a class it extends (e.g.,water.AbstractH2OExtension
) and contains a list of classes that extend the service class. For example, for core H2O REST API we have a single fileh2o-core/src/main/resources/META-INF/services/water.api.RestApiExtension
which contains 3 REST API extensions implementing interfacewater.api.RestApiExtension
: {code} water.api.RegisterResourceRoots water.api.RegisterV3Api water.api.RegisterV4Api {code}We provide capabilities REST end-point to provide list of registered core extensions, REST API extensions, parsers (WIP)
In the scope of H2O source code, we provide optional
@AutoService
annotation to register extensions (see [documentation|https://github.com/google/auto/tree/master/service]).We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend and fails gracefully if the user invokes an operation that is not provided by backend
h3. Impact of change
water.api.AbstractRegister
class will need to be updated by adding a service file as described abovewater.api.Schema
needs to be registered as well inwater.api.Schema
service file.h3. Preview of changes
h2. Improvement of Stacked Ensemble API
h3. Motivation
h3. Actions
Overview of the current arguments and what should be done about each one:
x
: Not needed and not being used at all (should be removed).y
: Only required if we keeptraining_frame
, since we need to grab the response column data for training the metalearner and if we have the whole training frame, then this is the only way to identify which is the response column.training_frame
: Having the whole training frame is not required. We actually only need just the response column. This would be replaced withresponse_column
.model_id
: This is the id for the "Stacked Ensemble" model.validation_frame
: Keep as is.base_models
: Keep as is, but we need to relax the restrictions on these models.selection_strategy
: This doesn't do anything, so we should remove it from the R/Python API until it does something other than use all the models each time.Proposed API (R example):
{code} h2o.stackedEnsemble(base_models, response_frame, validation_frame = NULL, model_id = NULL, ...) {code}
h3. Impact of change
...
in R andkwargs
in Python to handle (and properly map) the extra arguments automatically.h3. Preview of changes
h2. New feature: Automatic Machine Learning (AutoML)
h3. Motivation
h3. Actions
h3. Impact of change
h3. Preview of changes