Open andreped opened 2 years ago
@andreped Thank you for reporting this issue! Could you please specify the use cases for this feature. Thank you!
Could you please specify the use cases for this feature.
What do you mean by "use cases"? Do you mean scenarios on which having a simple way to perform gradient accumulation, would be beneficial?
Are you familiar with the concept of using gradient accumulation to "artificially" increase batch size while holding memory usage fixed? Essentially splitting a batch into smaller micro-batches, calculating the gradients for each, before averaging across, without having the entire batch in memory. It is a generic concept for "approximating" batch training - that is the use case. Or am I misunderstanding something?
/cc @georgepaw if he is interested in the design as Graphcore has an API for this in: https://github.com/graphcore/tensorflow/blob/r2.5/sdk-release-2.5/tensorflow/python/ipu/optimizers/gradient_accumulation_optimizer.py
@andreped Thanks for reporting the issue!
Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API. I would like to understand more here - do you see gradients accumulation widely used? If it's a popular feature, we will design an API for that.
@chenmoneygithub
Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API.
Agree. It's doable to implement it in a custom training loop. But at the same time, it would be feasible to have an API to do this with the model. fit
. Implementing a custom loop to have gradient accumulation is cumbersome (IMHO).
I would like to understand more here - do you see gradients accumulation widely used?
Yes, it's widely used when it's required. It's one of the techniques to enable larger batch training with limited computational resources. FYI, it's mentioned in pytorch-lighting as one of the effective training techniques.
Currently users would need to write their own custom training loop to handle the gradient accumulation
Actually, you don't even need to write your own custom training loops anymore in TF2. It is much easier to add support for it by overloading the train_step method. An example of how I did it can be seen here: https://github.com/andreped/GradientAccumulator/blob/main/GradientAccumulator/GAModelWrapper.py#L14
However, it is definitely a very commonly used method and it will surely be a popular feature to add. Perhaps having an API to do just what I did there, is a good idea? Not sure.
Note that it is important that this works in multi-GPU strategies, as that is one of its core usages. That is not something I have explored that much myself, but that is a popular use case for it.
Also note that if you introduce gradient accumulation naively, like I did above, then some layers will not be directly compatible. You will haven suboptimal behaviour on BatchNormalization, for instance, as it will update for every single micro-batch and not when the gradient accumulation is done for a given mini-batch. Has anyone made an attempt to fix BN for this use case? @innat @dathudeptrai
Hence, you might lose the effect of using gradient accumulation if you are use BN in your model, which is an extremely popular layer. Hence, it might be a good idea to solve that issue simulaneously.
The issue with BN in GA, has been thorough discussed for pytorch: https://forums.fast.ai/t/accumulating-gradients/33219/42
Attempts has been made, but as you can see it is not so easy to get it working properly: https://forums.fast.ai/t/accumulating-gradients/33219/62
Also note, that it appear common to just SUM the gradients in gradient accumulation instead of doing MEAN reduction. I think the latter makes more sense, but might be situations where SUM reduction is more correct. Not sure. Discussed in the same thread as mentioned.
Lastly, note this comment by @tomerk regarding how GA should be implemented in Keras (which might be a better idea than what I did, not sure): https://github.com/tensorflow/addons/issues/2260#issuecomment-747711685
Hope it helps!
Hey, I just wanted to throw in some personal experience with working on gradient accumulation in TF/Keras at Graphcore for IPUs.
accumulated_gradient = zeros()
for step in range(gradient_accumulation_factor):
micro_batch_gradients = ...
accumulated_gradient = ((step/(step +1)) * accumulated_gradient) + ((1/(step +1)) * micro_batch_gradients)
Thanks for tagging me! I'll take a look. For context: I've implemented a version that works with keras and model.fit() here: https://github.com/sokrypton/AccAdam_TF2
Thanks all for the great discussion!
@andreped Thanks for raising the BN issue, yes, it's something we should support. Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer - in your experiments, was there a big performance loss caused by the suboptimal treatment to BN?
For how to handle accumulation in distributed training, I believe the current Mirrored strategy can handle both SUM and MEAN. I need to double check with our distributed training experts on 1) if GPU distributed training could support aggregating over sub-batches and also across devices. 2) if this is supported in TPUStrategy.
Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer
@chenmoneygithub I have not performed a rigorous test to benchmark w/wo BN with GA, but intuition tells us that BN would update too frequently and would introduce noise to training. I have also observed this myself, where the benefit of increased batch size through GA is lost when BN was used to the model. But that is surely task and data dependent.
It has been suggested to change the default momentum hyperparameter based on the number of accumulations. Essentially, reducing momentum, the too-frequent-updates of BN would be introduce less noise, and therefore should result in more smooth behaviour. However, I have not seen an actual benchmark on this topic before, nor am I aware of best practice. Perhaps anyone else knows?
But it would surely be better to be able to accumulate the parameters in BN similarly as done for the gradients when using GA, instead of playing around with BN parameters.
EDIT: Also, this study might be a good read for anyone interested in this topic: https://arxiv.org/pdf/2110.12484.pdf
They also propose a modification to BN to work better with GA.
Can also be mentioned that this implementation for syncronization of BN exists: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization
Is primarily made for multi-GPU training, but I guess it could also be used on single GPU in GA scenario? Not sure
Just mentioning that I have a stable implementation for gradient accumulation now, as a temporary solution until Keras adds a proper method for it: https://github.com/andreped/GradientAccumulator
What is lacking is having multi-GPU support. However, for my use case it is not that critical. My main use case is to artificially increase batch size using a single-GPU.
I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API.
My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
I came across here looking for gradient accumulation where I will train using:
1- Multiple GPUs.
2- FP16
3- Functional API.
My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs.
In order to match the batch_size = 8000 mentioned in these papers, GA is a must.
BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.
At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.
But yeah, about time GA was added to Keras.
I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API. My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.
At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.
But yeah, about time GA was added to Keras.
I have looked into it but seeing that it does not support multi-GPUs, I haven't tested your tool.
I thought of trying this which has an example that seems to work with multi-gpu setting but, it seems you are aware of its existence and you have developed your tool despite that. So my question is, have you experienced an issue with fsx950223's implementation?
Moreover I have seen some mention of gradient accumulation causing issues with batch normalization layer (especially in existence of FP16). Have you looked into that, too?
Thanks for developing a tool for GA, btw.
@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.
In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.
Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.
However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.
BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.
EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)
@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.
In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.
Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.
However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.
BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.
EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)
This is great insight, especially regarding testing fsx950223's implementation thoroughly!
Since LMs are Transformer-based, which use BN heavily, then I think I will choose skipping GA for now. The issue with BN and GA make it even more crucial for this feature to be added to fit() as a complete and noob friendly parameter, e.g. grad_accumulation_steps = 4.
Definitely. It should just be an argument to set in Model.fit() or similar. GA is already available in pytorch-lightning, and others have added support for GA in their framework. About time Keras does the same.
But note that making BN compatible with GA is definitely not easy. There was some people in the PyTorch forum that tried, but I have yet to see a working solution. Not tempted to try myself, but perhaps I will have to make a go soon. I also actively use BN, so I have the same problem as you.
A temporary fix could be to adjust the momentum term in all BN layers, which should make it more robust with GA. If I was you I would at least try that :) Let me know how it goes. Always happy to contribute!
Hello @chenmoneygithub , any update on this issue? Thanks.
But note that making BN compatible with GA is definitely not easy.
If anyone is interested, I have now implemented a custom AccumBatchNormalization()
layer which can be used as a direct drop-in replacement in gradient accumulation scenarios. For more information, see here or the dedicated documentations.
Model wrapping also now reaches sufficiently comparable results to regular batch training, benchmarked on various scenarios. To get started:
pip install gradient-accumulator
Still remains to have multi-GPU support in a seemless manner. Any update regarding this, @chenmoneygithub?
Hi @andreped, thank you for all your work on this - would you be interested in contributing a PR adding gradient accumulation as a parameter to fit
? Multi-GPU support should be easier from Keras internally. Current new development work is being focused on the new multibackend Keras (https://keras.io/keras_core/) as this will supersede current Keras as Keras 3. Regardless of whether GA ends up implemented in TF-Keras we should strongly consider adding it to the next version of Keras via community contribution/when we have the internal bandwidth.
would you be interested in contributing a PR adding gradient accumulation as a parameter to
fit
?
Hello, @grasskin! Just came back from vacation. Sorry, for not replying earlier. I'm for sure interesting in contributing.
To summarize for you, and whoever might be interested, in the gradient-accumulator package, I have developed the following solutions:
Note that regarding multi-GPU training, I have not yet had the time to properly benchmark this, as I do not myself use multiple GPUs in my research simultaneously, and I am finalizing my PhD so little time for open-source funstuff.
@grasskin Regarding adding GA support for the model.fit
directly, I am not sure that is a good idea. There was an attempt made to do this for tf-keras
but there was an issue in tensorflow
hindering us from this working in a distributed setting. However, by wrapping the optimizer instead, this seemed to work as intended. So perhaps wrapping the optimizer, or adding support for GA there directly, should be the way to go. What do you think?
I can also add GA support to the BN layer and where appropriate, but it would be good if we could discuss this further. Could you contact me per e-mail: andrped94@gmail.com
Perhaps something like this can be used as a workaround:
https://blog.ivanukhov.com/2024/01/31/gradient-accumulation.html
Perhaps something like this can be used as a workaround:
@IvanUkhov This only addresses the Adam optimizer. If you want to wrap "any" optimizer, use this instead. I have made a simple example below on how to do this:
First install gradient-accumulator by:
$ pip install gradient-accumulator
Then you can add GA support to any optimizer wrapper by:
import tensorflow as tf
from gradient_accumulator import GradientAccumulateOptimizer
# some optimizer here
opt = tf.keras.optimizers.Adam(1e-3)
# wrap optimizer to add GA support, here increasing the batch size by 4 through gradient accumulation
opt = GradientAccumulateOptimizer(accum_steps=4, optimizer=opt)
Then you can do what you normally would.
If you dont want to wrap the optimizer itself, you can also wrap the model. You can see how to do that from the link above.
There is also a thorough documentation on how to add gradient accumulation to an existing model pipeline here: https://gradientaccumulator.readthedocs.io/en/latest/background/gradient_accumulation.html
NOTE: That in Keras 3, they seem to have made a solution for gradient accumulation now: https://github.com/keras-team/keras/pull/18948#issuecomment-1858991628
It is a PoC and not feature complete, but a step in the right direction :]
The approach was illustrated on Adam, but it is not limited to that. The code is very general. It is a matter of replacing which class is inherited from.
The optimizer wrapper in GradientAccumulator, on the other hand, adds a lot of fragile complexity. It does not play well with the latest version of TensorFlow where the internals of optimizers have been redesigned. Moreover, it is limited to SGD according to the documentation. Please correct me if that is not the case.
https://gradientaccumulator.readthedocs.io/en/latest/examples/distributed_training.html
It does not play well with the latest version of TensorFlow where the internals of optimizers have been redesigned. Moreover, it is limited to SGD according to the documentation. Please correct me if that is not the case.
Thats not the case. We have not upgraded the code itself to work with the latest 2.x versions, but I believe the Legacy Optimizer class is still there, which makes it work also for newer versions of TensorFlow. The implementation you showed is similar to what I made once in TF1, and that worked fine for adam, but even then I had an optimizer wrapper that worked well for both SGD, Adam, and other optimizers I normally used: https://github.com/AICAN-Research/H2G-Net/blob/main/src/utils/accum_optimizers.py#L11
The optimizer wrapper implementation in gradient-accumulator
should work with "any" commonly used optimizer, Adam included. But it is likely a better options to extend the optimizer wrapper, if you want to support more obscene optimizers.
What you are referring to is doing gradient accumulation in a distributed setting. I do not think there is any implementation in TF2 that works correctly in this scenario. Maybe Hugging Face has one, but the best is just to move to Keras 3 if that is of interest, IMO.
The implementation you showed does not work correctly in a distributed setting. You need more logic to handle how to handle replicas. That is not easy to solve, and why there does not seem to be a solution that does this properly. It is also not possible to do this in a model wrapper setting, as the syncronization from replicas results in an error. The optimizer wrapping is the only solution. Optimizer wrapping is actually what Keras 3 does as well.
But if you attempt to do GA where you split the model in two, across two GPUs, the optimizer wrapping solution should work. But if you want to split the batch across four GPUs, it does not. There is some logic missing and not trivial to solve, IMO.
What you are referring to is doing gradient accumulation in a distributed setting. I do not think there is any implementation in TF2 that works correctly in this scenario. Maybe Hugging Face has one, but the best is just to move to Keras 3 if that is of interest, IMO.
Yes, that is my use case. Distributed training on several GPUs with the mirrored strategy.
The implementation you showed does not work correctly in a distributed setting. You need more logic to handle how to handle replicas. That is not easy to solve, and why there does not seem to be a solution that does this properly. It is also not possible to do this in a model wrapper setting, as the syncronization from replicas results in an error. The optimizer wrapping is the only solution. Optimizer wrapping is actually what Keras 3 does as well.
That logic resides in the optimizer one inherits from, which was Adam in the example given. They are all distributed-training aware. The implementation works well. Do you have any specific concerns?
That logic resides in the optimizer one inherits from, which was Adam in the example given. They are all distributed-training aware. The implementation works well. Do you have any specific concerns?
Well thats very interesting. It could be that we should try to not override most of the new Optimizer class for newer TF versions. That way, this might just magically work. My issue, is that I have gotten it mechanically to work before, using multiple GPUs in a distributed setting, but the main problem is that I do not get the expected GPU memory use across each GPU and I fail to see the same benefit of increasing the batch size by 4 by splitting across 4 GPUs, as I see for a single-GPU sequential setup.
But if you have managed to get it working for Adam, we could try to run benchmarks in Kaggle (2 GPUs available for free). And if that works, we could just rewrite the Optimizer wrapper to use the new Optimizer class, which may require a lot less code :]
Are you interested in posting that issue here: https://github.com/andreped/GradientAccumulator/issues
And then we could maybe make a PR for it, @IvanUkhov, if we find that your solution works for newer TF versions.
Describe the feature and the current behavior/state:
Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.
Currently, there does not seem to be a straightforward way to use gradient accumulation in Keras.
What I have tried:
In TF1, we created a wrapper that can be used on any optimizer, which changes how and when the update should happen. I have tried to implement such a method in TF2, greatly inspired by the attempt by other developers at TF-addons, such as @fsx950223 and @stefan-falk https://github.com/tensorflow/addons/pull/2525. However, I have not managed to get expected behaviour (see here to see some of the experiments I performed, and here for the optimizer wrapper implementation).
I therefore looked around for alternative solutions and found this suggestion on stack overflow. I have expanded upon this idea and it seems to be working. After some thorough debugging and benchmarking, I have made a simple solution available in this repo, such that at least one simple solution for GA exists in TF2.
Proposed solution:
The idea is extremely simple. Overload the train_step method of the tf.keras.Model and add gradient accumulation support there. In the end, I have produced a simple model wrapper, which does that for you, which you ideally should be able to apply to any tf.keras.Model to enable gradient accumulation, like so:
However, there is definitely some work left to be done to make it handle all scenarios, but it seems to be working fine on the use cases I have tested until now.
So what remains?
Currently, I am unsure whether this is the best approach. Perhaps there is a better way of solving this. A challenge might be to get distributed training working with multiple GPUs. I believe that was the biggest obstacle with the optimizer wrapper solution.
Are there any devs working on adding gradient accumulation support in Keras?
Are you willing to contribute it: Yes