Trainer: Cannot train with 3+ GPUs / Uneven Memory Consumption

oborchers commented 3 years ago

Environment info

transformers version: 4.9.1
Platform: Linux-4.15.0-156-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.9.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?:

Who can help

@sgugger @patil-suraj

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[] the official example scripts: (give details below)
[X] my own modified scripts: I'm just using the Trainer class to train a model

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: Custom proprietary dataset

To reproduce

I'm running the Trainer class and I'm essentially just fine tune a GPT-Neo variant. I don't use any specific CLI options and just call python train.py.

What happens? With EleutherAI/gpt-neo-1.3B I am running into CUDA OOM memory errors depending on how much GPUs I want to use for training. For example:

1 GPUs: Works
2 GPUs: Works
3 GPUs: OOM

So effectively I am unable to train with more than 2 GPUs.

training_args = TrainingArguments(
    output_dir='results', 
    num_train_epochs=EPOCHS, 
    logging_steps=EPOCHS,
    load_best_model_at_end=True, 
    save_strategy="epoch", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE, 
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=100, 
    weight_decay=0.01, 
    logging_dir='logs',
    report_to="none",
    save_total_limit=15,
    seed=42,
)

# start training
Trainer(model=model, 
        args=training_args, 
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: {
            'input_ids': torch.stack([f[0] for f in data]),
            'attention_mask': torch.stack([f[1] for f in data]),
            'labels': torch.stack([f[0] for f in data]),
        }
).train()

The memory consumption on those two GPUs is also very imbalanced:

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   78C    P0   195W / 300W |  32212MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   83C    P0   281W / 300W |  16096MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I also tried running the training script with the torch.distributed command, but that doesn't work either for me. For example:

python -m torch.distributed.launch --nproc_per_node=2 train.py

Am I missing something obvious?

Expected behavior

The trainer should be able to handle more GPUs than 2.

sgugger commented 3 years ago

When you use python train.py, you use PyTorch DataParallel behind the scenes which only computes gradients and optimizer states on the main GPU. This is why you see this unbalance in memory usage.

When using python -m torch.distributed.launch --nproc_per_node=2 train.py (which is the recommended way according to the PyTorch documentation) each GPU will have a copy of the gradients and optimizer states so the memory usage will be balanced across GPUs.

In both cases the number of GPUs should not affect the fact you go OOM or not, unless you have batches of dynamic sizes. In this case, it's best to ensure the largest batches come first so you see the OOM as soon as possible.

oborchers commented 3 years ago

@sgugger: Many thanks for the fast reply. That's why I am so confused why this ain't working (and I've played around with it quite a lot). The batch size is not dynamic (because the tokenizer is set to max_length, which is also set to 512) and is always set to 2 or 1 for this very example.

The code runs completely containerized, the container only has access to specific device IDs. The devices are not occupied otherwise by any other services or processes.

Let me describe the results better:

python train.py and 2 GPU available + batch size = 2:

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   68C    P0   280W / 300W |  32212MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   57C    P0   287W / 300W |  16096MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.19s/it

python train.py and 3 GPU available + batch size = 2:

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    Trainer(model=model,
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1280, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1791, in training_step
    loss.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 34, in backward
    return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 45, in forward
    return comm.reduce_add_coalesced(grads_, destination)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced
    flat_result = reduce_add(flat_tensors, destination)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 95, in reduce_add
    result = torch.empty_like(inputs[root_index])
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 29.39 GiB already allocated; 25.75 MiB free; 30.20 GiB reserved in total by PyTorch)

Speed: None

python train.py and 3 GPUs available + batch size = 1: -> Trains, but runtime is much slower as with 2 GPUs + batch size = 1

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   67C    P0   288W / 300W |  32232MiB / 32510MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   55C    P0   291W / 300W |  11554MiB / 32510MiB |     92%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   57C    P0   292W / 300W |  11530MiB / 32510MiB |     94%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.06s/it

python -m torch.distributed.launch --nproc_per_node=1 train.py and 3 GPU available + batch size = 1:

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   66C    P0   280W / 300W |  31868MiB / 32510MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   42C    P0    41W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   41C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.69it/s

python -m torch.distributed.launch --nproc_per_node=2 train.py and 3 GPU available + batch size = 2:

OOM despite the fact that this works perfectly fine with python train.py

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 31.75 GiB total capacity; 29.71 GiB already allocated; 5.75 MiB free; 30.28 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                         | 1/150780 [00:01<66:55:39,  1.60s/it]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2360) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is the training code I'm using:

model_name = "EleutherAI/gpt-neo-1.3B"
EPOCHS = 10
BATCH_SIZE = 2

import os
import re
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch 

USER_TOKEN = "<user>"
BOT_TOKEN = "<bot>"

class DialogData(Dataset):
    def __init__(self, dialogs, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for dialog in dialogs:
            spans = []
            # ...
            prep_txt = (
                "".join(spans) + 
                "<|endoftext|>"
            )
            encodings_dict = tokenizer(
                prep_txt, 
                truncation=True,
                max_length=max_length, 
                padding="max_length"
            )
            # append to list
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            self.labels.append(torch.tensor(encodings_dict['input_ids']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx], self.labels[idx]

model_suffix = model_name.split("/")[-1]

print(f"Training: {model_name}")
print(f"Suffix: {model_suffix}")

torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    bos_token='<|startoftext|>',
    eos_token='<|endoftext|>', 
    pad_token='<|pad|>'
)
tokenizer.add_tokens([USER_TOKEN, BOT_TOKEN])
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
model.resize_token_embeddings(len(tokenizer))

dialogs = [f"{USER_TOKEN}I am an arbitrary training dataset" for _ in range(10_000)]
X_train, X_test= train_test_split(
    dialogs,
    shuffle=True, 
    test_size=0.05, 
    random_state=1,
)

train_dataset = DialogData(X_train, tokenizer, max_length=512)
eval_dataset = DialogData(X_test, tokenizer, max_length=512)

print(f"Training dataset: {len(train_dataset)}")
print(f"Evaluate dataset: {len(eval_dataset)}")

training_args = TrainingArguments(
    output_dir='results', 
    num_train_epochs=EPOCHS, 
    logging_steps=EPOCHS,
    load_best_model_at_end=True, 
    save_strategy="epoch", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE, 
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=100, 
    weight_decay=0.01, 
    logging_dir='logs',
    report_to="none",
    save_total_limit=15,
    seed=42,
)

Trainer(model=model, 
        args=training_args, 
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: {
            'input_ids': torch.stack([f[0] for f in data]),
            'attention_mask': torch.stack([f[1] for f in data]),
            'labels': torch.stack([f[0] for f in data]),
        }
).train()

sgugger commented 3 years ago

I don't see anything out of the ordinary:

raising the batch size will get you OOM on GPU-0
distributed data parallel might take a little bit more space than DataParallel and you were super tight on GPU-0
raising the number of GPUs will slow down the iterations a little bit because of communication, but you will also get less iterations since you are raising the actual batch size (actual batch size = batch size x number of GPUs)

oborchers commented 3 years ago

I see - perhaps because of the tightness I was assuming this was an error from code side, but thinking about this it makes sense that there is very little room for anything overhead. I actually tried to also go for fp16, but that doesn't work either due to nans. I will try a bit more, but probably I'll just let it run until it's done. Many thanks for your help!

oborchers commented 3 years ago

Can safely confirm that it works nicely out of the box with the 125M variant of the model. Thus I will have to play around with Zero or FP16 to understand how to get it to work with the larger ones. Many thanks!

oborchers commented 3 years ago

@sgugger, actually it was much more difficult to get to the result where I wanted to be. I can now train with fp16 enabled and with Zero2 on 3 GPUs (more tests to come with more GPUs). The problem seems to have been resolved by running the container in which the training takes place using with certain args:

docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=0,1,6"' -v $(pwd):/home -v /data/shared/transformers:/var/transformers trainer

where --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 did the trick.

Otherwise I was simply not able to run deepspeed train.py without running into NCCL errors.

huggingface / transformers