scikit-learn-contrib / scikit-learn-contrib

scikit-learn compatible projects
408 stars 50 forks source link

Request for inclusion: ML-Ensemble #20

Open flennerhag opened 7 years ago

flennerhag commented 7 years ago

Request for project inclusion in scikit-learn-contrib

jnothman commented 7 years ago

This is quite an impressive project. But it has quite a broad scope and is hard to evaluate the extent to which it adheres strictly to interface specs. I am, for instance, a little concerned about how the evaluate, initialize, terminate, preprocess public methods on Evaluator fit in, or why Evaluator here really belongs in a package about ensembles. It appears to mostly be about efficiently specifying complex search spaces and perhaps efficiently searching them.

Overall it looks like you've tried quite hard to adhere to the API, and I'd like to play around with this some more. I'm +1 for inclusion.

jnothman commented 7 years ago

Also, your comment is welcome on our current contender for a stacked ensemble implementation at https://github.com/scikit-learn/scikit-learn/pull/8960

flennerhag commented 7 years ago

@jnothman thanks for the +1!

To address your concerns, the API consists of four modules: ensemble, model_selection, preprocessing, visualization.

Ensemble

The mlens.ensemble module is the main component that houses all ensemble estimator classes. These classes are designed to behave as Scikit-learn estimators, and the only break with the Scikit-learn API is the instantiation part: whereas Scikit-learn requires an estimator to be fully initiated upon calling the constructor, ML-Ensemble requires the user to additionally add at least one layer:

ens = Ensemble()
ens.add(list_of_estimators)  # Additional step: need to specify estimators 
ens.fit(X, y)

Other than that, an ensemble behaves just like any other Scikit-learn estimator.

Model Selection

mlens.model_selection contains the Evaluator class. This is basically a type of sklearn.model_selection.RandomizedSearchCV that tunes several estimators over several preprocessing pipelines in one go. I found this to be a key object for tuning large ensembles, since fitting the entire ensemble repeatedly quickly becomes extremely expensive.

In particular, it is really powerful for tuning deep layers or the meta estimator. The mlens.preprocessing.EnsembleTransformer class allows the user to treat the lower level(s) of the ensemble as a transformer that is fitted once (the preprocess method), after which several estimators (including further layers) can be tuned and compared on top of he lower level(s) output. Tuning the ensemble greedily in this fashion is orders of magnitude faster than naively running a grid-search over the entire ensemble hyper-param space.

As for the API, the initialize and terminate methods are legacy methods that can be removed. The preprocess and evaluate methods can be incorporated into the fit method either as inferred sub-calls (e.g. nothing to preprocess, continue) or by introducing one or two parameter(s) to the fit method.

Preprocess / Visualization

These two modules are mostly nice-to-haves that have survived since the first version of the package. Could be dispensed with if you have a strong preference for trimming the package.

For an idea of how the auxiliary modules are meant to support the ensembles, this kaggle kernel I wrote as an introduction might be helpful.

Hope this clarifies things, feel free to keep probing otherwise–I'd be happy to play around with the package!

flennerhag commented 7 years ago

And I'd be happy to take a look at the stacking PR when I get the chance.

GaelVaroquaux commented 7 years ago

the only break with the Scikit-learn API is the instantiation part: whereas Scikit-learn requires an estimator to be fully initiated upon calling the constructor, ML-Ensemble requires the user to additionally add at least one layer:

ens = Ensemble() ens.add(list_of_estimators) # Additional step: need to specify estimators ens.fit(X, y)

That's a quite strong breakage. Too me, ML-ensemble does not implement the scikit-learn API.

Why not specify the list of estimators at construction time?

flennerhag commented 7 years ago

@GaelVaroquaux I understand you point of view and for building a small ensemble, it certainly would be possible to have an estimators argument in the constructor. Building multi-layer ensembles would then requires a list of lists or most likely dicts.

But once you want each layer to behave differently, specifying everything in one go becomes rather messy. For this, you'd need something like a list or dictionary or layer-wise parameter dictionaries. For instance, to build a two-layer subsemble, compare

ens = Subsemble()
ens.add(ests_level_0, prep_level_0, partitions=4, folds=10, proba=True)
ens.add(ests_level_1, prep_level_1, partitions=2, folds=5, proba=False, propagate_features=[0, 1])
ens.add(meta_est, meta=True)

with specifying all in the constructor:

ens = SuperLearner({'layer-1':
                        {'estimators': ests_level_0,
                         'preprocessing': prep_level_0,
                         'partitions': 4,
                         'folds': 10,
                         'proba': True
                         },
                     'layer-2':
                        {'estimators': ests_level_1,
                         'preprocessing': prep_level_1,
                         'partitions': 2,
                         'folds': 5,
                         'proba': False,
                         'propagate_features': [0, 1]
                        },
                     'meta':
                        {'estimators': meta_est,
                         'meta': True
                        }
                     }
                   )

To me at least, specifying all in the constructor is not as user-friendly and liable to cause misspecification.

I might add that Scikit-learn's own contender for a stacking ensemble (see above link) neither builds a full ensemble in one constructor call. Instead, each layer is constructed separately and then explicitly pipelined:

layer_0 = StackLayer(ests_1)
layer_1 = StackLayer(ests_2)
ensemble = make_pipeline(layer_0, layer_1, Est())

To do differential pipelines, the user would additionally need to involve FeatureUnion mechanisms.

Since ML-Ensemble is not meant to be part of Scikit-learn proper and ML-Ensemble estimators consistently breaks the API in one predictable way, personally I'd prefer implementing an ensemble via the add method than cramming everything into the constructor call.

flennerhag commented 7 years ago

I suppose one alternative would be to have users specify specific layers and then pass them to an ensemble class, like so:

layer_1 = SubsembleLayer(ests, preps, *args)
layer_2 = SubsembleLayer(ests, preps, *args)
meta = MetaLayer(Est())
ensemble = Subsemble(layer_1, layer_2, meta)

Would that be preferable?

jnothman commented 7 years ago

Am i mistaken or is there an undocumented constructor parameter for the layers? @GaelVaroquaux, this facilitates clone, so in a way you can think of .add as a specialised set_params. I find this design more user friendly than scikit-learn's for composite estimators.

flennerhag commented 7 years ago

Yes that's exactly correct.

An ensemble is instantiated with a layers parameter that is None by default. The add method then does

if not self.layers:
   self.layers = LayerContainer()

self.layers.add(layer)

so the ensembles are safe to clone (cloning is unit tested).

caioaao commented 6 years ago

@flennerhag Correct me if I'm wrong, but one would mostly benefit from your library when iterating on a model, right? If that's the case I think you could focus on your API instead of scikit learn's and build adapters for transforming to and from scikit learn's pipelines (transforming layers into FeatureUnions and stacked ensembles into a Pipeline - probably using what I did on scikit-learn/scikit-learn#8960 would be even better :grimacing: ). The user would only need to transform it into a scikit learn class after some iterations.

This way, you'd have a better separation of concerns while also leveraging most of the benefits from scikit learn's API and be able to evolve this library without the need to worry about scikit learn's standards. best of both worlds :)

Also, congrats on the project. looks great!

chkoar commented 4 years ago

The package is impressive. Since estimators are clonable I am +1 for inclusion.