Open flennerhag opened 7 years ago
This is quite an impressive project. But it has quite a broad scope and is hard to evaluate the extent to which it adheres strictly to interface specs. I am, for instance, a little concerned about how the evaluate
, initialize
, terminate
, preprocess
public methods on Evaluator
fit in, or why Evaluator
here really belongs in a package about ensembles. It appears to mostly be about efficiently specifying complex search spaces and perhaps efficiently searching them.
Overall it looks like you've tried quite hard to adhere to the API, and I'd like to play around with this some more. I'm +1 for inclusion.
Also, your comment is welcome on our current contender for a stacked ensemble implementation at https://github.com/scikit-learn/scikit-learn/pull/8960
@jnothman thanks for the +1!
To address your concerns, the API consists of four modules: ensemble
, model_selection
, preprocessing
, visualization
.
The mlens.ensemble
module is the main component that houses all ensemble estimator classes. These classes are designed to behave as Scikit-learn estimators, and the only break with the Scikit-learn API is the instantiation part: whereas Scikit-learn requires an estimator to be fully initiated upon calling the constructor, ML-Ensemble requires the user to additionally add at least one layer:
ens = Ensemble()
ens.add(list_of_estimators) # Additional step: need to specify estimators
ens.fit(X, y)
Other than that, an ensemble behaves just like any other Scikit-learn estimator.
mlens.model_selection
contains the Evaluator
class. This is basically a type of sklearn.model_selection.RandomizedSearchCV
that tunes several estimators over several preprocessing pipelines in one go. I found this to be a key object for tuning large ensembles, since fitting the entire ensemble repeatedly quickly becomes extremely expensive.
In particular, it is really powerful for tuning deep layers or the meta estimator. The mlens.preprocessing.EnsembleTransformer
class allows the user to treat the lower level(s) of the ensemble as a transformer that is fitted once (the preprocess
method), after which several estimators (including further layers) can be tuned and compared on top of he lower level(s) output. Tuning the ensemble greedily in this fashion is orders of magnitude faster than naively running a grid-search over the entire ensemble hyper-param space.
As for the API, the initialize
and terminate
methods are legacy methods that can be removed. The preprocess
and evaluate
methods can be incorporated into the fit
method either as inferred sub-calls (e.g. nothing to preprocess, continue) or by introducing one or two parameter(s) to the fit
method.
These two modules are mostly nice-to-haves that have survived since the first version of the package. Could be dispensed with if you have a strong preference for trimming the package.
For an idea of how the auxiliary modules are meant to support the ensembles, this kaggle kernel I wrote as an introduction might be helpful.
Hope this clarifies things, feel free to keep probing otherwise–I'd be happy to play around with the package!
And I'd be happy to take a look at the stacking PR when I get the chance.
the only break with the Scikit-learn API is the instantiation part: whereas Scikit-learn requires an estimator to be fully initiated upon calling the constructor, ML-Ensemble requires the user to additionally add at least one layer:
ens = Ensemble() ens.add(list_of_estimators) # Additional step: need to specify estimators ens.fit(X, y)
That's a quite strong breakage. Too me, ML-ensemble does not implement the scikit-learn API.
Why not specify the list of estimators at construction time?
@GaelVaroquaux I understand you point of view and for building a small ensemble, it certainly would be possible to have an estimators
argument in the constructor. Building multi-layer ensembles would then requires a list of lists or most likely dicts.
But once you want each layer to behave differently, specifying everything in one go becomes rather messy. For this, you'd need something like a list or dictionary or layer-wise parameter dictionaries. For instance, to build a two-layer subsemble, compare
ens = Subsemble()
ens.add(ests_level_0, prep_level_0, partitions=4, folds=10, proba=True)
ens.add(ests_level_1, prep_level_1, partitions=2, folds=5, proba=False, propagate_features=[0, 1])
ens.add(meta_est, meta=True)
with specifying all in the constructor:
ens = SuperLearner({'layer-1':
{'estimators': ests_level_0,
'preprocessing': prep_level_0,
'partitions': 4,
'folds': 10,
'proba': True
},
'layer-2':
{'estimators': ests_level_1,
'preprocessing': prep_level_1,
'partitions': 2,
'folds': 5,
'proba': False,
'propagate_features': [0, 1]
},
'meta':
{'estimators': meta_est,
'meta': True
}
}
)
To me at least, specifying all in the constructor is not as user-friendly and liable to cause misspecification.
I might add that Scikit-learn's own contender for a stacking ensemble (see above link) neither builds a full ensemble in one constructor call. Instead, each layer is constructed separately and then explicitly pipelined:
layer_0 = StackLayer(ests_1)
layer_1 = StackLayer(ests_2)
ensemble = make_pipeline(layer_0, layer_1, Est())
To do differential pipelines, the user would additionally need to involve FeatureUnion
mechanisms.
Since ML-Ensemble is not meant to be part of Scikit-learn proper and ML-Ensemble estimators consistently breaks the API in one predictable way, personally I'd prefer implementing an ensemble via the add
method than cramming everything into the constructor call.
I suppose one alternative would be to have users specify specific layers and then pass them to an ensemble class, like so:
layer_1 = SubsembleLayer(ests, preps, *args)
layer_2 = SubsembleLayer(ests, preps, *args)
meta = MetaLayer(Est())
ensemble = Subsemble(layer_1, layer_2, meta)
Would that be preferable?
Am i mistaken or is there an undocumented constructor parameter for the layers? @GaelVaroquaux, this facilitates clone, so in a way you can think of .add as a specialised set_params. I find this design more user friendly than scikit-learn's for composite estimators.
Yes that's exactly correct.
An ensemble is instantiated with a layers
parameter that is None
by default. The add
method then does
if not self.layers:
self.layers = LayerContainer()
self.layers.add(layer)
so the ensembles are safe to clone (cloning is unit tested).
@flennerhag Correct me if I'm wrong, but one would mostly benefit from your library when iterating on a model, right? If that's the case I think you could focus on your API instead of scikit learn's and build adapters for transforming to and from scikit learn's pipelines (transforming layers into FeatureUnion
s and stacked ensembles into a Pipeline
- probably using what I did on scikit-learn/scikit-learn#8960 would be even better :grimacing: ). The user would only need to transform it into a scikit learn class after some iterations.
This way, you'd have a better separation of concerns while also leveraging most of the benefits from scikit learn's API and be able to evolve this library without the need to worry about scikit learn's standards. best of both worlds :)
Also, congrats on the project. looks great!
The package is impressive. Since estimators are clonable I am +1 for inclusion.
Request for project inclusion in scikit-learn-contrib