keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
58 stars 28 forks source link

Gradient accumulation support? #107

Open andreped opened 2 years ago

andreped commented 2 years ago

Describe the feature and the current behavior/state:

Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.

Currently, there does not seem to be a straightforward way to use gradient accumulation in Keras.

What I have tried:

In TF1, we created a wrapper that can be used on any optimizer, which changes how and when the update should happen. I have tried to implement such a method in TF2, greatly inspired by the attempt by other developers at TF-addons, such as @fsx950223 and @stefan-falk https://github.com/tensorflow/addons/pull/2525. However, I have not managed to get expected behaviour (see here to see some of the experiments I performed, and here for the optimizer wrapper implementation).

I therefore looked around for alternative solutions and found this suggestion on stack overflow. I have expanded upon this idea and it seems to be working. After some thorough debugging and benchmarking, I have made a simple solution available in this repo, such that at least one simple solution for GA exists in TF2.

Proposed solution:

The idea is extremely simple. Overload the train_step method of the tf.keras.Model and add gradient accumulation support there. In the end, I have produced a simple model wrapper, which does that for you, which you ideally should be able to apply to any tf.keras.Model to enable gradient accumulation, like so:

model = tf.keras.Model(...)
model = GAModelWrapper(n_gradients=k, inputs=model.input, outputs=model.output)

However, there is definitely some work left to be done to make it handle all scenarios, but it seems to be working fine on the use cases I have tested until now.

So what remains?

Currently, I am unsure whether this is the best approach. Perhaps there is a better way of solving this. A challenge might be to get distributed training working with multiple GPUs. I believe that was the biggest obstacle with the optimizer wrapper solution.

Are there any devs working on adding gradient accumulation support in Keras?

Are you willing to contribute it: Yes

sushreebarsa commented 2 years ago

@andreped Thank you for reporting this issue! Could you please specify the use cases for this feature. Thank you!

andreped commented 2 years ago

Could you please specify the use cases for this feature.

What do you mean by "use cases"? Do you mean scenarios on which having a simple way to perform gradient accumulation, would be beneficial?

Are you familiar with the concept of using gradient accumulation to "artificially" increase batch size while holding memory usage fixed? Essentially splitting a batch into smaller micro-batches, calculating the gradients for each, before averaging across, without having the entire batch in memory. It is a generic concept for "approximating" batch training - that is the use case. Or am I misunderstanding something?

bhack commented 2 years ago

/cc @georgepaw if he is interested in the design as Graphcore has an API for this in: https://github.com/graphcore/tensorflow/blob/r2.5/sdk-release-2.5/tensorflow/python/ipu/optimizers/gradient_accumulation_optimizer.py

chenmoneygithub commented 2 years ago

@andreped Thanks for reporting the issue!

Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API. I would like to understand more here - do you see gradients accumulation widely used? If it's a popular feature, we will design an API for that.

innat commented 2 years ago

@chenmoneygithub

Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API.

Agree. It's doable to implement it in a custom training loop. But at the same time, it would be feasible to have an API to do this with the model. fit. Implementing a custom loop to have gradient accumulation is cumbersome (IMHO).

I would like to understand more here - do you see gradients accumulation widely used?

Yes, it's widely used when it's required. It's one of the techniques to enable larger batch training with limited computational resources. FYI, it's mentioned in pytorch-lighting as one of the effective training techniques.

andreped commented 2 years ago

Currently users would need to write their own custom training loop to handle the gradient accumulation

Actually, you don't even need to write your own custom training loops anymore in TF2. It is much easier to add support for it by overloading the train_step method. An example of how I did it can be seen here: https://github.com/andreped/GradientAccumulator/blob/main/GradientAccumulator/GAModelWrapper.py#L14

However, it is definitely a very commonly used method and it will surely be a popular feature to add. Perhaps having an API to do just what I did there, is a good idea? Not sure.

Note that it is important that this works in multi-GPU strategies, as that is one of its core usages. That is not something I have explored that much myself, but that is a popular use case for it.

andreped commented 2 years ago

Also note that if you introduce gradient accumulation naively, like I did above, then some layers will not be directly compatible. You will haven suboptimal behaviour on BatchNormalization, for instance, as it will update for every single micro-batch and not when the gradient accumulation is done for a given mini-batch. Has anyone made an attempt to fix BN for this use case? @innat @dathudeptrai

Hence, you might lose the effect of using gradient accumulation if you are use BN in your model, which is an extremely popular layer. Hence, it might be a good idea to solve that issue simulaneously.

The issue with BN in GA, has been thorough discussed for pytorch: https://forums.fast.ai/t/accumulating-gradients/33219/42

Attempts has been made, but as you can see it is not so easy to get it working properly: https://forums.fast.ai/t/accumulating-gradients/33219/62

Also note, that it appear common to just SUM the gradients in gradient accumulation instead of doing MEAN reduction. I think the latter makes more sense, but might be situations where SUM reduction is more correct. Not sure. Discussed in the same thread as mentioned.

Lastly, note this comment by @tomerk regarding how GA should be implemented in Keras (which might be a better idea than what I did, not sure): https://github.com/tensorflow/addons/issues/2260#issuecomment-747711685

Hope it helps!

georgepaw commented 2 years ago

Hey, I just wanted to throw in some personal experience with working on gradient accumulation in TF/Keras at Graphcore for IPUs.

  1. Batch Norm - for the MLPerf submission distributed batch norm is used to calculate the statistics over a bigger batch. For example if we have 64 replicas, each running with batch size 128, we could simulate batch size 256 by exchanging stats between each pair of replicas.
  2. Accumulation method - we've implemented three methods - we often experiment with lower precision formats, for example using fp16 for the gradient accumulation tensors. a) sum - might overflow in fp16 b) mean - feels the most natural - if the gradients are normalised before accumulation they might underflow. If they are normalised after accumulation they might overflow. c) running mean - we've found this to be more stable for lower precision formats:
    accumulated_gradient = zeros()
    for step in range(gradient_accumulation_factor):
    micro_batch_gradients = ...
    accumulated_gradient = ((step/(step +1)) * accumulated_gradient) + ((1/(step +1)) * micro_batch_gradients)
sokrypton commented 2 years ago

Thanks for tagging me! I'll take a look. For context: I've implemented a version that works with keras and model.fit() here: https://github.com/sokrypton/AccAdam_TF2

chenmoneygithub commented 2 years ago

Thanks all for the great discussion!

@andreped Thanks for raising the BN issue, yes, it's something we should support. Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer - in your experiments, was there a big performance loss caused by the suboptimal treatment to BN?

For how to handle accumulation in distributed training, I believe the current Mirrored strategy can handle both SUM and MEAN. I need to double check with our distributed training experts on 1) if GPU distributed training could support aggregating over sub-batches and also across devices. 2) if this is supported in TPUStrategy.

andreped commented 2 years ago

Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer

@chenmoneygithub I have not performed a rigorous test to benchmark w/wo BN with GA, but intuition tells us that BN would update too frequently and would introduce noise to training. I have also observed this myself, where the benefit of increased batch size through GA is lost when BN was used to the model. But that is surely task and data dependent.

It has been suggested to change the default momentum hyperparameter based on the number of accumulations. Essentially, reducing momentum, the too-frequent-updates of BN would be introduce less noise, and therefore should result in more smooth behaviour. However, I have not seen an actual benchmark on this topic before, nor am I aware of best practice. Perhaps anyone else knows?

But it would surely be better to be able to accumulate the parameters in BN similarly as done for the gradients when using GA, instead of playing around with BN parameters.


EDIT: Also, this study might be a good read for anyone interested in this topic: https://arxiv.org/pdf/2110.12484.pdf

They also propose a modification to BN to work better with GA.


Can also be mentioned that this implementation for syncronization of BN exists: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization

Is primarily made for multi-GPU training, but I guess it could also be used on single GPU in GA scenario? Not sure

andreped commented 2 years ago

Just mentioning that I have a stable implementation for gradient accumulation now, as a temporary solution until Keras adds a proper method for it: https://github.com/andreped/GradientAccumulator

What is lacking is having multi-GPU support. However, for my use case it is not that critical. My main use case is to artificially increase batch size using a single-GPU.

meliksahturker commented 1 year ago

I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API.

My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.

andreped commented 1 year ago

I came across here looking for gradient accumulation where I will train using:

1- Multiple GPUs.

2- FP16

3- Functional API.

My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs.

In order to match the batch_size = 8000 mentioned in these papers, GA is a must.

BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.

@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.

At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.

But yeah, about time GA was added to Keras.

meliksahturker commented 1 year ago

I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API. My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.

@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.

At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.

But yeah, about time GA was added to Keras.

I have looked into it but seeing that it does not support multi-GPUs, I haven't tested your tool.

I thought of trying this which has an example that seems to work with multi-gpu setting but, it seems you are aware of its existence and you have developed your tool despite that. So my question is, have you experienced an issue with fsx950223's implementation?

Moreover I have seen some mention of gradient accumulation causing issues with batch normalization layer (especially in existence of FP16). Have you looked into that, too?

Thanks for developing a tool for GA, btw.

andreped commented 1 year ago

@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.

In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.

Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.

However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.

BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.


EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)

meliksahturker commented 1 year ago

@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.

In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.

Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.

However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.

BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.

EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)

This is great insight, especially regarding testing fsx950223's implementation thoroughly!

Since LMs are Transformer-based, which use BN heavily, then I think I will choose skipping GA for now. The issue with BN and GA make it even more crucial for this feature to be added to fit() as a complete and noob friendly parameter, e.g. grad_accumulation_steps = 4.

andreped commented 1 year ago

Definitely. It should just be an argument to set in Model.fit() or similar. GA is already available in pytorch-lightning, and others have added support for GA in their framework. About time Keras does the same.

But note that making BN compatible with GA is definitely not easy. There was some people in the PyTorch forum that tried, but I have yet to see a working solution. Not tempted to try myself, but perhaps I will have to make a go soon. I also actively use BN, so I have the same problem as you.

A temporary fix could be to adjust the momentum term in all BN layers, which should make it more robust with GA. If I was you I would at least try that :) Let me know how it goes. Always happy to contribute!

innat commented 1 year ago

Hello @chenmoneygithub , any update on this issue? Thanks.

andreped commented 1 year ago

But note that making BN compatible with GA is definitely not easy.

If anyone is interested, I have now implemented a custom AccumBatchNormalization() layer which can be used as a direct drop-in replacement in gradient accumulation scenarios. For more information, see here or the dedicated documentations.

Model wrapping also now reaches sufficiently comparable results to regular batch training, benchmarked on various scenarios. To get started:

pip install gradient-accumulator

Still remains to have multi-GPU support in a seemless manner. Any update regarding this, @chenmoneygithub?

grasskin commented 1 year ago

Hi @andreped, thank you for all your work on this - would you be interested in contributing a PR adding gradient accumulation as a parameter to fit? Multi-GPU support should be easier from Keras internally. Current new development work is being focused on the new multibackend Keras (https://keras.io/keras_core/) as this will supersede current Keras as Keras 3. Regardless of whether GA ends up implemented in TF-Keras we should strongly consider adding it to the next version of Keras via community contribution/when we have the internal bandwidth.

andreped commented 1 year ago

would you be interested in contributing a PR adding gradient accumulation as a parameter to fit?

Hello, @grasskin! Just came back from vacation. Sorry, for not replying earlier. I'm for sure interesting in contributing.

To summarize for you, and whoever might be interested, in the gradient-accumulator package, I have developed the following solutions:

Note that regarding multi-GPU training, I have not yet had the time to properly benchmark this, as I do not myself use multiple GPUs in my research simultaneously, and I am finalizing my PhD so little time for open-source funstuff.

@grasskin Regarding adding GA support for the model.fit directly, I am not sure that is a good idea. There was an attempt made to do this for tf-keras but there was an issue in tensorflow hindering us from this working in a distributed setting. However, by wrapping the optimizer instead, this seemed to work as intended. So perhaps wrapping the optimizer, or adding support for GA there directly, should be the way to go. What do you think?

I can also add GA support to the BN layer and where appropriate, but it would be good if we could discuss this further. Could you contact me per e-mail: andrped94@gmail.com

IvanUkhov commented 7 months ago

Perhaps something like this can be used as a workaround:

https://blog.ivanukhov.com/2024/01/31/gradient-accumulation.html

andreped commented 7 months ago

Perhaps something like this can be used as a workaround:

@IvanUkhov This only addresses the Adam optimizer. If you want to wrap "any" optimizer, use this instead. I have made a simple example below on how to do this:

First install gradient-accumulator by:

$ pip install gradient-accumulator

Then you can add GA support to any optimizer wrapper by:

import tensorflow as tf
from gradient_accumulator import GradientAccumulateOptimizer

# some optimizer here
opt = tf.keras.optimizers.Adam(1e-3)

# wrap optimizer to add GA support, here increasing the batch size by 4 through gradient accumulation
opt = GradientAccumulateOptimizer(accum_steps=4, optimizer=opt)

Then you can do what you normally would.

If you dont want to wrap the optimizer itself, you can also wrap the model. You can see how to do that from the link above.

There is also a thorough documentation on how to add gradient accumulation to an existing model pipeline here: https://gradientaccumulator.readthedocs.io/en/latest/background/gradient_accumulation.html


NOTE: That in Keras 3, they seem to have made a solution for gradient accumulation now: https://github.com/keras-team/keras/pull/18948#issuecomment-1858991628

It is a PoC and not feature complete, but a step in the right direction :]

IvanUkhov commented 7 months ago

The approach was illustrated on Adam, but it is not limited to that. The code is very general. It is a matter of replacing which class is inherited from.

The optimizer wrapper in GradientAccumulator, on the other hand, adds a lot of fragile complexity. It does not play well with the latest version of TensorFlow where the internals of optimizers have been redesigned. Moreover, it is limited to SGD according to the documentation. Please correct me if that is not the case.

https://gradientaccumulator.readthedocs.io/en/latest/examples/distributed_training.html

andreped commented 7 months ago

It does not play well with the latest version of TensorFlow where the internals of optimizers have been redesigned. Moreover, it is limited to SGD according to the documentation. Please correct me if that is not the case.

Thats not the case. We have not upgraded the code itself to work with the latest 2.x versions, but I believe the Legacy Optimizer class is still there, which makes it work also for newer versions of TensorFlow. The implementation you showed is similar to what I made once in TF1, and that worked fine for adam, but even then I had an optimizer wrapper that worked well for both SGD, Adam, and other optimizers I normally used: https://github.com/AICAN-Research/H2G-Net/blob/main/src/utils/accum_optimizers.py#L11

The optimizer wrapper implementation in gradient-accumulator should work with "any" commonly used optimizer, Adam included. But it is likely a better options to extend the optimizer wrapper, if you want to support more obscene optimizers.

What you are referring to is doing gradient accumulation in a distributed setting. I do not think there is any implementation in TF2 that works correctly in this scenario. Maybe Hugging Face has one, but the best is just to move to Keras 3 if that is of interest, IMO.

The implementation you showed does not work correctly in a distributed setting. You need more logic to handle how to handle replicas. That is not easy to solve, and why there does not seem to be a solution that does this properly. It is also not possible to do this in a model wrapper setting, as the syncronization from replicas results in an error. The optimizer wrapping is the only solution. Optimizer wrapping is actually what Keras 3 does as well.

But if you attempt to do GA where you split the model in two, across two GPUs, the optimizer wrapping solution should work. But if you want to split the batch across four GPUs, it does not. There is some logic missing and not trivial to solve, IMO.

IvanUkhov commented 7 months ago

What you are referring to is doing gradient accumulation in a distributed setting. I do not think there is any implementation in TF2 that works correctly in this scenario. Maybe Hugging Face has one, but the best is just to move to Keras 3 if that is of interest, IMO.

Yes, that is my use case. Distributed training on several GPUs with the mirrored strategy.

The implementation you showed does not work correctly in a distributed setting. You need more logic to handle how to handle replicas. That is not easy to solve, and why there does not seem to be a solution that does this properly. It is also not possible to do this in a model wrapper setting, as the syncronization from replicas results in an error. The optimizer wrapping is the only solution. Optimizer wrapping is actually what Keras 3 does as well.

That logic resides in the optimizer one inherits from, which was Adam in the example given. They are all distributed-training aware. The implementation works well. Do you have any specific concerns?

andreped commented 7 months ago

That logic resides in the optimizer one inherits from, which was Adam in the example given. They are all distributed-training aware. The implementation works well. Do you have any specific concerns?

Well thats very interesting. It could be that we should try to not override most of the new Optimizer class for newer TF versions. That way, this might just magically work. My issue, is that I have gotten it mechanically to work before, using multiple GPUs in a distributed setting, but the main problem is that I do not get the expected GPU memory use across each GPU and I fail to see the same benefit of increasing the batch size by 4 by splitting across 4 GPUs, as I see for a single-GPU sequential setup.

But if you have managed to get it working for Adam, we could try to run benchmarks in Kaggle (2 GPUs available for free). And if that works, we could just rewrite the Optimizer wrapper to use the new Optimizer class, which may require a lot less code :]

Are you interested in posting that issue here: https://github.com/andreped/GradientAccumulator/issues

And then we could maybe make a PR for it, @IvanUkhov, if we find that your solution works for newer TF versions.