Open baothienpp opened 6 years ago
Hi @baothienpp,
I've implemented LinearCurvePredictor
, which is a simple, but rather efficient method. In my experiments, it was good enough and saved ~50% of training time, though I haven't tried more sophisticated models. The downside is that it requires a burn-in period of ~20-25 full training cycles, before it could understand the learning curves.
See the code in curve_predictor.py. Feel free to implement BaseCurvePredictor
if you wish try any other approximator.
That sounds interesting though, cause I tried the implementation from the paper. It took a lot of computational power because of Monte Carlo calculation. I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?
It took a lot of computational power because of Monte Carlo calculation.
Yeah, I can imagine.
I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?
It's a simple linear regression, implemented by applying a normal equation. The whole math is in _compute_matrix
method, all around it is just to make it nicer. Intuitively, it computes an average learning curve from the set of existing ones. The stop condition is that current learning curve is significantly worse than the curves seen so far. Until you have tens of thousands of learning curves it's very efficient.
May i ask you why you don't fit a polynomial instead of linear ? Do you think we could use Gaussian Process with square exponential to model the learn curve ?
Hi I think I figured out why you don't use polynomial because you fit a linear on a set of learning curves ( multivariables regression). At first, I understood that you fit a linear on every single curve and make prediction base on that. So that means the burn-in period is the set of learning curves you have to collect first , did i understand you correctly ?
Hi @baothienpp ,
Correct, the features are the whole curve. So the predictor doesn't try to learn trends or something like that, it compares the given curve to the set of previous ones and checks the probability it'll be better. The burn-in period is basically the training data for the predictor.
I'm sure there are more sophisticated models, and I'd love to have more implementations in the library. If you're interested to contribute, I'd be happy to merge it.
By the way, I've added a bunch examples lately. Please take a look, looking forward to your feedback.
Thanks for those examples, really help. I am thinking about using Bayesian linear regression (blr) instead of simple linear regression. blr output will be a normal distribution, we could use simple math to calculate the probability that a learning curve will be good or bad. I will try it first, and report later. Generally, I like the idea of using simple regression over the model in the paper, it is just too much computational overhead
@baothienpp Sounds great. Looking forward to seeing your model in action. When you will test it, take a look at the tests.
Hi Maxim, short unrelated question : If i want to use your idea in some of my work, how can i cite you ?
Hi @baothienpp
That'll be great if you do this. Please use this code:
@article{podkolzine17,
author = {Maxim Podkolzine},
title = {Hyper-Engine: Hyper-parameters Tuning for Machine Learning},
journal = {https://github.com/maxim5/hyper-engine},
year = {2017},
}
Of course, I'll be curious to read the paper once it's out, so don't forget to post the link here ;)
Thanks ! Unfortunately it is something for work so i can't public :( , but don't worry i cited you. It seems like your framework can only handle single GPU, any chances for multi GPU?
So i did build a new model using your idea. I used Bayesian ridge regression. Basically, in Linear Regression you minimize the MSE error and in Ridge Regression you minimize the (MSE+ L2 regularization), for more detail you can read here. Bayesian ridge regression is then the probabilistic version of ridge regression which output is mean and variance. I then calculate the probability that current curve yield a better high than the previous best, the formula is exactly the one in Probability Improvement. I tested it with your cifar10 learning curve set. Here the result (the dashed lines are the curves that used in burn-in) . With a burn-in period as small as 5 , it still has good prediction
This looks really impressive: the burn-in period 5 is very low! Thanks for the update. If you can make a pull request or somehow share your code, I'd incorporate it in the lib, and it looks like a good default. Otherwise I'll try to replicate your results from scratch.
Sorry, I forgot about your question: right now, the model itself can go multi-gpu and that's it. I'd implement distributed training on the library level, but I think the trivial Bayesian optimization will assign the same hyper-parameters to all GPUs, so it doesn't make sense. It should be a bit smarter and run different optimizations in parallel, e.g., UCB on GPU 0 and PI method on GPU 1.
I am currently a bit busy, but i will soon upload a short code to describe how i did it because i implemented it different from your interface. Another question, is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?
OK. No problem.
is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?
Yes, see BayesianPortfolioStrategy
. It is possible to fix the distribution over utilities or it will construct a distribution based on their performance.
So i am gonna briefly describe my method. I used scikit-learn to implement BRR ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge). It has 2 method fit() and predict() , it is important to set the parameter return_std in predict() to true. So now you have the prediction and the std. To calculate the probability , i used the scipy package to calculate the cdf :
prediction, std = self.predict()
#self.target is the max value of the current best curve
probability = stats.norm(prediction, std).cdf(np.inf) - stats.norm(prediction, std).cdf(max(self.target))
# the total probability on the whole normal distribution is 100% , but since i only consider one half of it as 100%, if the value is bigger than 0.5 it has 100% probability
probability = min(probability * 100 / 0.5, 100)
#if probability < 75 , terminate !
if probability < 75:
Got it. Do you use the same data as I did, i.e. the set of learning curves?
Yes i used the curves in your json file
Hi , i am from Stackoverflow. I am trying to understand your implementation from the paper " Extrapolating of Learning Curve .. ". As far as i understand , they use 11 different mathematic model to fit the learning curve and then predict with monte carlo estimator. But i can't find in your code where you built these model and where the monte carlo calculation are. Can you please clarify it ? Thanks