Announce changes in API to H2O Community

Hello H2O community!

There are many new changes in the H2O ecosystem, and we are working furiously to publish and share these with the community.

In this context, we are preparing a new H2O release 3.12 with amazing features (e.g., AutoML, XGBoost support). We are also planning some changes that can affect existing code bases. This email is meant to inform you and start discussions about them.

The changes include:

migrating from Java 6 to Java 7
modularization of the code base with the help of Java Service Provider Interface (SPI) instead of using reflections library
Improvement of Stacked Ensemble API
New feature: Automatic Machine Learning (AutoML)

h2. Migration from Java 6 to Java 7 h3. Motivation

Java 6 public support was ended in [February 2013|http://www.oracle.com/technetwork/java/eol-135779.html]
Lack of Java 6 compatible libraries (e.g., Jetty)
Security concerns with using old libraries to keep compatibility with Java 6

h3. Actions

We will remove Java 6 support from the H2O build chain including:
- removal of artifact byte code rewriting from Java 7 to Java 6
- upgrading Animal Sniffer signature to Java 7
We are going to publish only Java 7 compatible binary artifacts to Maven Central.
We are going to use only Java 7 compatible syntax in our source code base. The only exception is h2o-genmodel module, which we will try to keep close to Java 6 syntax.

h3. Impact of change If your stack is running on top of Java 6 JVM (e.g., old Hadoop distribution, proprietary tools), then H2O will stop working. Please let us know!

h3. Preview of changes

The change is implemented in [PR-835|https://github.com/h2oai/h2o-3/pull/835]
The JIRA epic number is [PUBDEV-4049|https://0xdata.atlassian.net/browse/PUBDEV-4049]

h2. Modularization of code base

h3. Motivation

We would like to provide more flexible system to extend H2O and plug new tools into the H2O platform (e.g, XGBoost, TensorFlow, Sparkling Water).
The current code base is using [reflections library|https://github.com/ronmamo/reflections] to handle lookup of optional components, however it brings several issues including:
- limit on used package name by extension (only water and hex are allowed)
- force traversal of full classpath which causes problems in systems with dynamic classloaders (e.g., Spark executors). kk{{monospaced text}} h3. Actions
We will remove usage of reflections library to find instances of water.AbstractH2OExtension, water.api.AbstractRegister and water.api.Schema
The extensions (meaning classes listed in the previous point) will be registered using [Java Service Provider Interface|https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html]. In short, the concept relies on service files that are located in META-INF/services directory. Each service file is called by the name of a class it extends (e.g., water.AbstractH2OExtension) and contains a list of classes that extend the service class. For example, for core H2O REST API we have a single file h2o-core/src/main/resources/META-INF/services/water.api.RestApiExtension which contains 3 REST API extensions implementing interface water.api.RestApiExtension: {code} water.api.RegisterResourceRoots water.api.RegisterV3Api water.api.RegisterV4Api {code}
We provide capabilities REST end-point to provide list of registered core extensions, REST API extensions, parsers (WIP)
In the scope of H2O source code, we provide optional @AutoService annotation to register extensions (see [documentation|https://github.com/google/auto/tree/master/service]).
We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend and fails gracefully if the user invokes an operation that is not provided by backend

Note: the same concept is already used in H2O to register parsers and Rapids extensions.

h3. Impact of change

Code that register new REST API calls by extending water.api.AbstractRegister class will need to be updated by adding a service file as described above
Each class extending water.api.Schema needs to be registered as well in water.api.Schema service file.

h3. Preview of changes

The change is implemented in [PR-915|https://github.com/h2oai/h2o-3/pull/915]
The JIRA epic number is PUBDEV-4271

h2. Improvement of Stacked Ensemble API

h3. Motivation

We have some unnecessary arguments in the Stacked Ensembles API that should be deprecated. In particular, the current API requires the training set, when the algorithm doesn't actually require it. It only requires a single-colum response frame, so we are requiring the user to load more data into memory than is required to train the ensemble.

h3. Actions

Overview of the current arguments and what should be done about each one:

x: Not needed and not being used at all (should be removed).
y: Only required if we keep training_frame, since we need to grab the response column data for training the metalearner and if we have the whole training frame, then this is the only way to identify which is the response column.
training_frame: Having the whole training frame is not required. We actually only need just the response column. This would be replaced with response_column.
model_id: This is the id for the "Stacked Ensemble" model.
validation_frame: Keep as is.
base_models: Keep as is, but we need to relax the restrictions on these models.
selection_strategy: This doesn't do anything, so we should remove it from the R/Python API until it does something other than use all the models each time.

Proposed API (R example):

{code} h2o.stackedEnsemble(base_models, response_frame, validation_frame = NULL, model_id = NULL, ...) {code}

h3. Impact of change

Old code will still work. We will add the ... in R and kwargs in Python to handle (and properly map) the extra arguments automatically.

h3. Preview of changes

The JIRA is [PUBDEV-4240]

h2. New feature: Automatic Machine Learning (AutoML)

h3. Motivation

We have designed an easy-to-use interface which automates the process of training a large selection of candidate models, and also creating ensembles of these models.
H2O’s AutoML provides a simple wrapper function that performs a large number of modeling-related tasks (which would typically require many lines of code).

h3. Actions

Added new functions to R and Python to enable AutoML in H2O.
The AutoML object includes a leaderboard: models are ranked by a specific model performance metric.

h3. Impact of change

This is a new feature, so it doesn't impact your code.
Hopefully, this saves users a lot of time in their modeling and ensembling efforts.

h3. Preview of changes

AutoML is already in master and you can download the nightly release [here|http://h2o-release.s3.amazonaws.com/h2o/master/latest.html].
The R and Python APIs are still in flux until the 3.12 release.
All outstanding AutoML proposed changes or new features are documented in JIRA tickets [here|https://0xdata.atlassian.net/issues/?filter=20700].

h2oai / h2o-3

Announce changes in API to H2O Community #11335