In Chapter 6 - Summarization Training - Fine Tuning, I tried everything I could think of to make the PEGASUS retrain work. We start with google/pegasus-cnn_dailymail, then add the Samsung Summaries dataset, then train on it.

Many many configurations and techniques failed to avert the "CUDA error: out of memory" error. What isn't apparent, however, is that there are some highly effective memory saving steps that you can take just by adding some training arguments.

Also what isn't obvious, is that when a model trains, it is highly likely that it will expand to a huge size very quickly and then continue to grab more GPU memory, so all the steps you might take to setup the memory in preparation for train() are of no use, unless you've setup training arguments to not blow up.

Another thing that's not obvious is that many models 'almost fit', or 'just barely fit', and minor adjustments can actually make them fit and train. With other models, the complexity of the model and its training demands more invasive changes.

NOTE: the reason models blow up in size is typically because size is being balanced with speed. I believe that every single step below makes training models slower. In some cases you'd want to compare with training on the CPU, especially if you have a lot of cores and huge RAM.

The other more pernicious potential issue is models that perform poorly or "not as well as they ought to".

This is all just to say that, models that train are better than models that don't train at all, but maybe not by much.

PYTORCH_CUDA_ALLOC_CONF

The change that's actually suggested right in the error message is modifying PYTORCH_CUDA_ALLOC_CONF. What's not necessarily obvious though is that this value should be set as low as possible, because it's impossible to predict the size of the thing that Torch is going to try to allocate, and if it's smaller than PYTORCH_CUDA_ALLOC_CONF, it will fail. I've even seen things like 16M objects crash a training, which, so sad, you can't do anything about because PYTORCH_CUDA_ALLOC_CONF minimum is 20M.

Anyway - os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"

Training Args

batch_size

Next most commonly recommended, and most commonly effective, is batch size. What isn't obvious is how big each batch is, all things included. Where the initial batch size is greater than 2, you can often just iteratively dial it down until your train() runs.

batch_size=4 (or whatever is less than what you started with - if desperate then 2, then 1) (default is 8)

When you hit 1 you are totally out of luck. Unless of course you're reading this guide, in which case - follow me...

gradient_accumulation_steps

Even in the lifesaving HF guide to "How to Fit a Bigger Model", they describe this as follows:

The idea behind gradient accumulation is to instead of calculating the gradients for the whole batch at once to do it in smaller steps, by calculating the gradients iteratively in smaller batches by doing a forward and backward pass through the model and accumulating the gradients in the process. When enough gradients are accumulated we run the model’s optimization step. This way we can easily increase the overall batch size to numbers that would never fit into the GPU’s memory.

Okay, yes, correct, but: this also works when no batch size, including 1, will train(). In other words, this can actually be a lifesaver when you have no more batch_size reductions available.

Of course it's also a good idea for tuning your batch sizes if that's something you have the luxury of focusing on.

It may not be obvious, but it's a larger value we want here.

gradient_accumulation_steps=64 (default is 1)

gradient_checkpointing

In order to compute the gradients during the backward pass all activations from the forward pass are normally saved. This can create a big memory overhead. Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training.

Gradient checkpointing strikes a compromise between the two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See this great article explaining the ideas behind gradient checkpointing.

Fine, whatever - gradient_checkpointing=True (default is False)

optimizer

The most common optimizer used to train transformer model is Adam or AdamW (Adam with weight decay). Adam achieves good convergence by storing the rolling average of the previous gradients which, however, adds an additional memory footprint of the order of the number of model parameters. One remedy to this is to use an alternative optimizer such as Adafactor.

Instead of keeping the rolling average for each element in the weight matrices Adafactor only stores aggregated information (row- and column-wise sums of the rolling averages) which reduces the footprint considerably. One downside of Adafactor is that in some instances convergence can be slower than Adam’s so some experimentation is advised here. We can use Adafactor simply by setting optim="adafactor":

This one seems to have been recently modified - I had to upgrade my transformers and accelerate in order for tha argument to be accepted

optim="adafactor" (default is adamw_hf)

trainingArgs = TrainingArguments(output_dir="/home/user/z_Data/HF/z_Cache/pegasus_samsum",
                                num_train_epochs=1, 
                                 warmup_steps=500,
                                per_device_train_batch_size=1,
                                per_device_eval_batch_size=1,
                                weight_decay=0.01,
                                logging_steps=10,
                                push_to_hub=False,
                                evaluation_strategy='steps',
                                eval_steps=500,
                                save_steps=1e6,
                                 optim="adafactor",
                                 gradient_checkpointing=True,
                                gradient_accumulation_steps=64)

Finally, the model trained! (slowly)

Resources:

https://huggingface.co/docs/transformers/v4.19.2/en/performance

nlp-with-transformers / notebooks

CUDA out of memory when training "google/pegasus-cnn_dailymail" on "samsum" #106

Information

Describe the bug

To Reproduce

Expected behavior