maxim5 / hyper-engine

Python library for Bayesian hyper-parameters optimization
https://pypi.python.org/pypi/hyperengine
Apache License 2.0
86 stars 22 forks source link

Curve Predictor #3

Open baothienpp opened 6 years ago

baothienpp commented 6 years ago

Hi , i am from Stackoverflow. I am trying to understand your implementation from the paper " Extrapolating of Learning Curve .. ". As far as i understand , they use 11 different mathematic model to fit the learning curve and then predict with monte carlo estimator. But i can't find in your code where you built these model and where the monte carlo calculation are. Can you please clarify it ? Thanks

maxim5 commented 6 years ago

Hi @baothienpp,

I've implemented LinearCurvePredictor, which is a simple, but rather efficient method. In my experiments, it was good enough and saved ~50% of training time, though I haven't tried more sophisticated models. The downside is that it requires a burn-in period of ~20-25 full training cycles, before it could understand the learning curves.

See the code in curve_predictor.py. Feel free to implement BaseCurvePredictor if you wish try any other approximator.

baothienpp commented 6 years ago

That sounds interesting though, cause I tried the implementation from the paper. It took a lot of computational power because of Monte Carlo calculation. I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?

maxim5 commented 6 years ago

It took a lot of computational power because of Monte Carlo calculation.

Yeah, I can imagine.

I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?

It's a simple linear regression, implemented by applying a normal equation. The whole math is in _compute_matrix method, all around it is just to make it nicer. Intuitively, it computes an average learning curve from the set of existing ones. The stop condition is that current learning curve is significantly worse than the curves seen so far. Until you have tens of thousands of learning curves it's very efficient.

baothienpp commented 6 years ago

May i ask you why you don't fit a polynomial instead of linear ? Do you think we could use Gaussian Process with square exponential to model the learn curve ?

baothienpp commented 6 years ago

Hi I think I figured out why you don't use polynomial because you fit a linear on a set of learning curves ( multivariables regression). At first, I understood that you fit a linear on every single curve and make prediction base on that. So that means the burn-in period is the set of learning curves you have to collect first , did i understand you correctly ?

maxim5 commented 6 years ago

Hi @baothienpp ,

Correct, the features are the whole curve. So the predictor doesn't try to learn trends or something like that, it compares the given curve to the set of previous ones and checks the probability it'll be better. The burn-in period is basically the training data for the predictor.

I'm sure there are more sophisticated models, and I'd love to have more implementations in the library. If you're interested to contribute, I'd be happy to merge it.

maxim5 commented 6 years ago

By the way, I've added a bunch examples lately. Please take a look, looking forward to your feedback.

baothienpp commented 6 years ago

Thanks for those examples, really help. I am thinking about using Bayesian linear regression (blr) instead of simple linear regression. blr output will be a normal distribution, we could use simple math to calculate the probability that a learning curve will be good or bad. I will try it first, and report later. Generally, I like the idea of using simple regression over the model in the paper, it is just too much computational overhead

maxim5 commented 6 years ago

@baothienpp Sounds great. Looking forward to seeing your model in action. When you will test it, take a look at the tests.

baothienpp commented 6 years ago

Hi Maxim, short unrelated question : If i want to use your idea in some of my work, how can i cite you ?

maxim5 commented 6 years ago

Hi @baothienpp

That'll be great if you do this. Please use this code:

@article{podkolzine17,
  author  = {Maxim Podkolzine},
  title   = {Hyper-Engine: Hyper-parameters Tuning for Machine Learning},
  journal = {https://github.com/maxim5/hyper-engine},
  year    = {2017},
}

Of course, I'll be curious to read the paper once it's out, so don't forget to post the link here ;)

baothienpp commented 6 years ago

Thanks ! Unfortunately it is something for work so i can't public :( , but don't worry i cited you. It seems like your framework can only handle single GPU, any chances for multi GPU?

baothienpp commented 6 years ago

So i did build a new model using your idea. I used Bayesian ridge regression. Basically, in Linear Regression you minimize the MSE error and in Ridge Regression you minimize the (MSE+ L2 regularization), for more detail you can read here. Bayesian ridge regression is then the probabilistic version of ridge regression which output is mean and variance. I then calculate the probability that current curve yield a better high than the previous best, the formula is exactly the one in Probability Improvement. I tested it with your cifar10 learning curve set. Here the result (the dashed lines are the curves that used in burn-in) curves_compare. With a burn-in period as small as 5 , it still has good prediction

maxim5 commented 6 years ago

This looks really impressive: the burn-in period 5 is very low! Thanks for the update. If you can make a pull request or somehow share your code, I'd incorporate it in the lib, and it looks like a good default. Otherwise I'll try to replicate your results from scratch.

maxim5 commented 6 years ago

Sorry, I forgot about your question: right now, the model itself can go multi-gpu and that's it. I'd implement distributed training on the library level, but I think the trivial Bayesian optimization will assign the same hyper-parameters to all GPUs, so it doesn't make sense. It should be a bit smarter and run different optimizations in parallel, e.g., UCB on GPU 0 and PI method on GPU 1.

baothienpp commented 6 years ago

I am currently a bit busy, but i will soon upload a short code to describe how i did it because i implemented it different from your interface. Another question, is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?

maxim5 commented 6 years ago

OK. No problem.

is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?

Yes, see BayesianPortfolioStrategy. It is possible to fix the distribution over utilities or it will construct a distribution based on their performance.

baothienpp commented 6 years ago

So i am gonna briefly describe my method. I used scikit-learn to implement BRR ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge). It has 2 method fit() and predict() , it is important to set the parameter return_std in predict() to true. So now you have the prediction and the std. To calculate the probability , i used the scipy package to calculate the cdf :

        prediction, std = self.predict()
        #self.target is the max value of the current best curve
        probability = stats.norm(prediction, std).cdf(np.inf) - stats.norm(prediction, std).cdf(max(self.target))
        # the total probability on the whole normal distribution is 100% , but since i only consider one half of it as 100%, if the value is bigger than 0.5 it has 100% probability 
        probability = min(probability * 100 / 0.5, 100)
        #if probability < 75 , terminate !
        if probability < 75:
maxim5 commented 6 years ago

Got it. Do you use the same data as I did, i.e. the set of learning curves?

baothienpp commented 6 years ago

Yes i used the curves in your json file