CUDA Out of Memory After Several Epochs

liyucheng09 commented 3 years ago

Environment info

transformers version: 4.1.1
Platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-centos-7.2-Final
Python version: 3.7.7
PyTorch version (GPU?): 1.5.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: True
Using distributed or parallel set-up in script?: using nn.data_parallel

Who can help

gpt2: @patrickvonplaten
trainer: @sgugger

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

[- ] the official example scripts: run_clm.py

The tasks I am working on is:

[ -] my own task or dataset: zh_wikitext

To reproduce

The strange thing is that the scripts runs ok in the first 12 epochs, and ends with error in the middle of 12 epochs. I have checked that the trainer doesn't cache training loss tensor, so I am quite puzzled by the error. Any help are highly appreciated.

Steps to reproduce the behavior:

python run_clm.py config.json

Several useful config in config.json are:

block_size: 512
check_point_name: "gpt2_result/checkpoint-100000"
per_device_train_batch_size: 12
learning_rate: 0.00005
weight_decay: 0
adam_beta1: 0.9
adam_beta2: 0.98
adam_epsilon: 1e-8
max_grad_norm: 1
num_train_epochs: 50
max_steps: -1
warmup_steps: 0

Model Config are:

Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 512,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 512,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "use_cache": true,
  "vocab_size": 21128
}

The tokenizer used is BertTokenizer.from_pretrained('Bert-base-chinese').

The error log are following:

[INFO|trainer.py:703] 2021-02-10 11:30:39,997 >> ***** Running training *****
[INFO|trainer.py:704] 2021-02-10 11:30:39,997 >>   Num examples = 744899
[INFO|trainer.py:705] 2021-02-10 11:30:39,997 >>   Num Epochs = 50
[INFO|trainer.py:706] 2021-02-10 11:30:39,997 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:707] 2021-02-10 11:30:39,997 >>   Total train batch size (w. parallel, distributed & accumulation) = 96
[INFO|trainer.py:708] 2021-02-10 11:30:39,997 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:709] 2021-02-10 11:30:39,997 >>   Total optimization steps = 388000
[INFO|trainer.py:725] 2021-02-10 11:30:40,011 >>   Continuing training from checkpoint, will skip to saved global_step
[INFO|trainer.py:726] 2021-02-10 11:30:40,011 >>   Continuing training from epoch 12
[INFO|trainer.py:727] 2021-02-10 11:30:40,011 >>   Continuing training from global step 100002
  0%|                                                                                                           | 0/388000 [00:00<?, ?it/s]/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 26%|███████████████████████▋                                                                    | 100003/388000 [00:17<00:50, 5746.78it/s]Traceback (most recent call last):
  File "run_clm.py", line 321, in <module>
    main()
  File "run_clm.py", line 291, in main
    trainer.train(model_path=model_path)
  File "/data/miniconda3/lib/python3.7/site-packages/transformers/trainer.py", line 799, in train
    tr_loss += self.training_step(model, inputs)
  File "/data/miniconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1139, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data/miniconda3/lib/python3.7/site-packages/transformers/trainer.py", line 1163, in compute_loss
    outputs = model(**inputs)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "<string>", line 9, in __init__
  File "/data/miniconda3/lib/python3.7/site-packages/transformers/file_utils.py", line 1412, in __post_init__
    for element in iterator:
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/data/miniconda3/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.88 GiB total capacity; 22.53 GiB already allocated; 86.38 MiB free; 23.21 GiB reserved in total by PyTorch)

sgugger commented 3 years ago

I'm quite puzzled too, to be honest. I know that sometimes, PyTorch will trigger a CUDA OOM error even if there is enough memory in theory just because it's not able to find a contiguous chunk or has some leftovers for some reason, exactly like what your message suggests (22.53GB allocated but 23.21GB reserved by PyTorch). I don't have any suggestion apart from trying the usual strategies to lower a bit the memory footprint (slightly lower the batch size or block size).

liyucheng09 commented 3 years ago

@sgugger Appreciate your reply! I am wondering that can I resume the training processing if I change the batch size or block size of the training args. I have no idea whether it will fit the saved schedule or optimizer parameters.

xinjicong commented 3 years ago

@sgugger Appreciate your reply! I am wondering that can I resume the training processing if I change the batch size or block size of the training args. I have no idea whether it will fit the saved schedule or optimizer parameters.

你好，请问你解决了这个问题了吗

liyucheng09 commented 3 years ago

@xinjicong Not yet. If you have some ideas, please shares.

xinjicong commented 3 years ago

@xinjicong Not yet. If you have some ideas, please shares.

i try to make max_seq_length smaller but it can't not work.

xinjicong commented 3 years ago

@xinjicong Not yet. If you have some ideas, please shares.

我检查了代码，发现是我在使用tokenizer的时候，出现了问题。tokenizer输出的维度多了一维，然后后面batch的时候就出错了。

dorooddorood606 commented 3 years ago

I observe the same issue, if I train a model, save a checkpoint and reload from this, I get memory issues for the code which was training fine before.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

akshaygulabrao commented 2 years ago

Same Issue

perceptiveshawty commented 2 years ago

+1

dinsausti commented 1 year ago

I have this issue as well. Model trains for 1 epoch and goes through validation step, then I get OOM somewhere in the second epoch. These are large models I am training and I often get OOM after it has been training for a couple of hours.

perceptiveshawty commented 1 year ago

@dinsausti-vir Try reducing validation batch size to 1. I'm not sure how I fixed the error but batch size is usually the cause for OOM

dinsausti commented 1 year ago

@perceptiveshawty Thanks for the tip. I will give that a shot!

huggingface / transformers