ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.18k stars 1.19k forks source link

Batch size manipulation on plateau #714

Closed theodor38 closed 3 weeks ago

theodor38 commented 4 years ago

Hi,

In my experience decreasing the batch size has provided better accuracy with the data set that I have. Is there a way to reduce batch size on plateau instead of increasing?

I have tried using 0.5 instead of 2 with "decrease batch size on plateau rate" but this crashes.

Thank you in advance

w4nderlust commented 4 years ago

@theodor38 Increasing the batch size is a practice that come from this paper: https://openreview.net/pdf?id=B1Yy1BxCZ It has become pretty standard. So no, there is no way at the moment to reduce the batch size on plateau. I could consider this a feature request if you want.

theodor38 commented 4 years ago

@w4nderlust I mean I try both to increase and to decrease. Decreasing worked because I have a quite a large data set and starting with 128 batch as it is default takes a very long time to search for optimum hps. So having a decrease option on plateau would be beneficial as it would serve both hp search and accuracy simultaneously -at least in my case. starting with large batch number and then gradually reducing it definitely helped me achieve better accuracy almost every time. And it is faster

Thank you for your response.

all the best

w4nderlust commented 4 years ago

I don't really understand how it can be faster :) The single batch surely is faster, but considering the timpe to complete an epoch (which is what matters) it will definitely be slower.

Also, a small batch size gives you a less accurate estimate of the true gradient (the gradient on all the dataset), so the rough intuition of that paper is that at the beginning of the training you may get a way with a worse estimation of the gradient because you are far from any minimum anyway, while later in the training you want more accurate estimates because you are closer to a minimum and the grients are smaller you reducing the noise in the gradients is a good idea.

I don't doubt your specific experience, would be interesting to study it in more detail, but goes kinda against my intuition and most papers I've read. That said, I'm more than willing to consider this a feature request and modify the implementation to make it work with decreasing batch sizes too. Ig you are interested i ncotributing this let me know and I will point you in the part of the code that needs to be changed, it would be really simple to do. Otherwise i will just add it to the backlog and eventually get to it ;)

theodor38 commented 4 years ago

Since I am using an evolutionary strategy to tune hps at first, starting with a large batch size is faster to converge on optimal hps and then making batch sizes smaller and smaller improves accuracy. smaller batch size == slower epoch completion. Although this assumes the optimal NN structure with large Batch size will work as well with a small one. Albeit it seems to.

Intuition from the paper seems solid. I will experiment more to make sure I am not fooling myself :) But so far I am convinced.

Yeah please point me in the right direction for the contrib and I will at least try :)

Thank you

theodor38 commented 4 years ago

IMG_4093

As you can see in this image. Evo algo was given a choice between 1000-20000 batch size to choose from. X axis is sorted by accuracy. From high to low(1 being the highest accuracy model). It ends up converging to a low batch size every time! This is with a stationary batch size throughout the entire training. This gave me the idea to start high and then lower it.

w4nderlust commented 4 years ago

Since I am using an evolutionary strategy to tune hps at first, starting with a large batch size is faster to converge on optimal hps and then making batch sizes smaller and smaller improves accuracy. smaller batch size == slower epoch completion. Although this assumes the optimal NN structure with large Batch size will work as well with a small one. Albeit it seems to.

IMG_4093

As you can see in this image. Evo algo was given a choice between 1000-20000 batch size to choose from. X axis is sorted by accuracy. From high to low(1 being the highest accuracy model). It ends up converging to a low batch size every time! This is with a stationary batch size throughout the entire training. This gave me the idea to start high and then lower it.

Intuition from the paper seems solid. I will experiment more to make sure I am not fooling myself :) But so far I am convinced.

If this is confirmed is definitely interesting to explore. Things I would look out for are: what is your fitness function? Meaning what is yout y axis on the plot? If it is training loss, you may be figuring out the parameters that best overfit :) if it is accuracy, make sure it's validation accuracy. Also make sure of minus sign if you are minimizing / maximizing, as you may actually be getting the opposite of what you want.

Yeah please point me in the right direction for the contrib and I will at least try :)

This is the relevant part of the code: https://github.com/uber/ludwig/blob/master/ludwig/models/model.py#L1291-L1299 https://github.com/uber/ludwig/blob/aa4e52ef948ec50e20e339f4b5bb859281ed9d59/ludwig/models/model.py#L1677-L1730 I think changing it to work on increase values between 0 and 1 would be easy, just make sure that the batch size never gets below zero and be mindful of recasting to integer. If you make it work feel free to create a PR.

On a separate note, would you consider contributing the evolutionary hp optimization? We have a branch containing a way to do hp optimization in Ludwig, it already contains grid and random strategies, there will be a bayesian one too soon, and having a evolutionary one would be really cool too. Here you can see how it's implemented: https://github.com/uber/ludwig/blob/hyperopt/ludwig/hyperopt.py https://github.com/uber/ludwig/blob/hyperopt/ludwig/utils/hyperopt_utils.py If you are interested, adding a class that implements the HyperoptStrategy interface should be very easy.

theodor38 commented 4 years ago

``

Meaning what is yout y axis on the plot? If it is training loss, you may be figuring out the parameters that best overfit :) if it is accuracy, make sure it's validation accuracy.

Y axis on the plot is batch size. Sorry to have not done a proper names on axis. x is models ranked by accuracy and y s batch size. in the x axis there are ~1500 models ranked and left most is the best accuracy model. I have a very robust validation set and all accuracy is measured with the validation set. There is little to no over fitting as these models are also tested in live production.

From reading papers on hps, I have seen that most concepts are based on conv net models. However I believe some of the intuition gained from conv nets do not apply to simple DNNs with regression etc. at least it seems like the case. Any thoughts on this?

I will do my best to contrib. It will be a challenge :)

w4nderlust commented 4 years ago

``

Meaning what is yout y axis on the plot? If it is training loss, you may be figuring out the parameters that best overfit :) if it is accuracy, make sure it's validation accuracy.

Y axis on the plot is batch size. Sorry to have not done a proper names on axis. x is models ranked by accuracy and y s batch size. in the x axis there are ~1500 models ranked and left most is the best accuracy model. I have a very robust validation set and all accuracy is measured with the validation set. There is little to no over fitting as these models are also tested in live production.

From reading papers on hps, I have seen that most concepts are based on conv net models. However I believe some of the intuition gained from conv nets do not apply to simple DNNs with regression etc. at least it seems like the case. Any thoughts on this?

Got it. It's an interesting finding then, worth exploring. I'm also curious about the production use cases you are using Ludwig for. Feel free to reach out to me privately to chat about it if they can't be discussed privately, as knowing about them can help me prioritize future developments.

I will do my best to contrib. It will be a challenge :)

Feel free to ask me if something is not clear :)