reiinakano / xcessiv

A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling in Python.
http://xcessiv.readthedocs.io
Apache License 2.0
1.27k stars 105 forks source link

Automated ensembling techniques #34

Closed reiinakano closed 7 years ago

reiinakano commented 7 years ago

Working for a while with Xcessiv, I feel there's a need for some way to automate the selection of base learners in an ensemble. I'm unaware of existing techniques for this, so if anyone has any suggestions or could point me towards relevant literature, it would be greatly appreciated.

techscientist commented 7 years ago

This would be an awesome idea.

One idea that I have would be to do the following:

  1. Ask the user to select an evaluation metric that he/she wishes to maximize (ie. accuracy) or minimize.
  2. Ask the user to select the maximum number of top-performing ensembles to have (like top-k ensembles based on their performance against the metric)
  3. Then perform random or grid-like combinations of multiple base estimators. For each one, train it, and then evaluate it against the metric. If its performance is in the top-k list of ensembles then add it, maintaining a rolling top-k list of the top-performing ensembles.

In this approach, it is definitely important to let the user quit the automation process while it is running.

How does this sound, @reiinakano ? Maybe this would be good for an initial implementation?

reiinakano commented 7 years ago

I haven't actually figured out the best way to let a user quit a process manually. Currently the only way to do that is to forcibly close the terminal running Xcessiv. One good thing about Xcessiv is that it stores meta-features of each base learner scored automatically, so it's actually quite fast to calculate the performance of one ensemble, since the only training you do is for the secondary estimator.

Anyway, I don't think it's necessary to maintain a rolling top-k list of ensembles. Instead I'd just store everything that was calculated in the database. You can easily sort with whatever metric you want anyway. This is currently what is done when you do Bayesian optimization for the base learners. The list of base learners just kind of auto-updates while the search is running.

I was thinking of doing something along the same lines for stacked ensembles. What I need is a smart algorithm or technique for selecting which base learners should be used and in what combinations. One way people do this is through a kind of greedy approach, iteratively trying out base learners to add and keeping it if the target metric rises. Of course, random combinations of base learners might actually be a good approach too, considering that it's better than grid search for optimizing base learners.

techscientist commented 7 years ago

@reiinakano , that's also a good point. So, maybe go with a random approach for now, and add more later? Perhaps more developers would add their ideas to this issue and other issues as time goes on

I also think that a random approach might be better than grid search for now.

reiinakano commented 7 years ago

Agreed, I think it's important to settle on some kind of framework so that in the future, different exploration methods can be added very easily. I certainly intend on adding things other than Bayesian optimization for optimizing base learners in the future e.g. hyperband

Thanks for your inputs! Appreciate them a lot!

reiinakano commented 7 years ago

Added automated ensembling based on greedy forward model selection in #43 and is in v0.5.0