huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.49k stars 26.89k forks source link

Trainer: Cannot train with 3+ GPUs / Uneven Memory Consumption #13986

Closed oborchers closed 3 years ago

oborchers commented 3 years ago

Environment info

Who can help

@sgugger @patil-suraj

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

I'm running the Trainer class and I'm essentially just fine tune a GPT-Neo variant. I don't use any specific CLI options and just call python train.py.

What happens? With EleutherAI/gpt-neo-1.3B I am running into CUDA OOM memory errors depending on how much GPUs I want to use for training. For example:

So effectively I am unable to train with more than 2 GPUs.

training_args = TrainingArguments(
    output_dir='results', 
    num_train_epochs=EPOCHS, 
    logging_steps=EPOCHS,
    load_best_model_at_end=True, 
    save_strategy="epoch", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE, 
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=100, 
    weight_decay=0.01, 
    logging_dir='logs',
    report_to="none",
    save_total_limit=15,
    seed=42,
)

# start training
Trainer(model=model, 
        args=training_args, 
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: {
            'input_ids': torch.stack([f[0] for f in data]),
            'attention_mask': torch.stack([f[1] for f in data]),
            'labels': torch.stack([f[0] for f in data]),
        }
).train()

The memory consumption on those two GPUs is also very imbalanced:

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   78C    P0   195W / 300W |  32212MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   83C    P0   281W / 300W |  16096MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I also tried running the training script with the torch.distributed command, but that doesn't work either for me. For example:

python -m torch.distributed.launch --nproc_per_node=2 train.py

Am I missing something obvious?

Expected behavior

The trainer should be able to handle more GPUs than 2.

sgugger commented 3 years ago

When you use python train.py, you use PyTorch DataParallel behind the scenes which only computes gradients and optimizer states on the main GPU. This is why you see this unbalance in memory usage.

When using python -m torch.distributed.launch --nproc_per_node=2 train.py (which is the recommended way according to the PyTorch documentation) each GPU will have a copy of the gradients and optimizer states so the memory usage will be balanced across GPUs.

In both cases the number of GPUs should not affect the fact you go OOM or not, unless you have batches of dynamic sizes. In this case, it's best to ensure the largest batches come first so you see the OOM as soon as possible.

oborchers commented 3 years ago

@sgugger: Many thanks for the fast reply. That's why I am so confused why this ain't working (and I've played around with it quite a lot). The batch size is not dynamic (because the tokenizer is set to max_length, which is also set to 512) and is always set to 2 or 1 for this very example.

The code runs completely containerized, the container only has access to specific device IDs. The devices are not occupied otherwise by any other services or processes.


Let me describe the results better:

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   68C    P0   280W / 300W |  32212MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   57C    P0   287W / 300W |  16096MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.19s/it

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    Trainer(model=model,
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1280, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1791, in training_step
    loss.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 34, in backward
    return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 45, in forward
    return comm.reduce_add_coalesced(grads_, destination)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced
    flat_result = reduce_add(flat_tensors, destination)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 95, in reduce_add
    result = torch.empty_like(inputs[root_index])
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 29.39 GiB already allocated; 25.75 MiB free; 30.20 GiB reserved in total by PyTorch)

Speed: None

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   67C    P0   288W / 300W |  32232MiB / 32510MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   55C    P0   291W / 300W |  11554MiB / 32510MiB |     92%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   57C    P0   292W / 300W |  11530MiB / 32510MiB |     94%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.06s/it

+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   66C    P0   280W / 300W |  31868MiB / 32510MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   42C    P0    41W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   41C    P0    43W / 300W |      3MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Speed: 1.69it/s

OOM despite the fact that this works perfectly fine with python train.py

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 31.75 GiB total capacity; 29.71 GiB already allocated; 5.75 MiB free; 30.28 GiB reserved in total by PyTorch)
  0%|                                                                                                                                                                         | 1/150780 [00:01<66:55:39,  1.60s/it]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2360) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is the training code I'm using:

model_name = "EleutherAI/gpt-neo-1.3B"
EPOCHS = 10
BATCH_SIZE = 2

import os
import re
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch 

USER_TOKEN = "<user>"
BOT_TOKEN = "<bot>"

class DialogData(Dataset):
    def __init__(self, dialogs, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for dialog in dialogs:
            spans = []
            # ...
            prep_txt = (
                "".join(spans) + 
                "<|endoftext|>"
            )
            encodings_dict = tokenizer(
                prep_txt, 
                truncation=True,
                max_length=max_length, 
                padding="max_length"
            )
            # append to list
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            self.labels.append(torch.tensor(encodings_dict['input_ids']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx], self.labels[idx]

model_suffix = model_name.split("/")[-1]

print(f"Training: {model_name}")
print(f"Suffix: {model_suffix}")

torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    bos_token='<|startoftext|>',
    eos_token='<|endoftext|>', 
    pad_token='<|pad|>'
)
tokenizer.add_tokens([USER_TOKEN, BOT_TOKEN])
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
model.resize_token_embeddings(len(tokenizer))

dialogs = [f"{USER_TOKEN}I am an arbitrary training dataset" for _ in range(10_000)]
X_train, X_test= train_test_split(
    dialogs,
    shuffle=True, 
    test_size=0.05, 
    random_state=1,
)

train_dataset = DialogData(X_train, tokenizer, max_length=512)
eval_dataset = DialogData(X_test, tokenizer, max_length=512)

print(f"Training dataset: {len(train_dataset)}")
print(f"Evaluate dataset: {len(eval_dataset)}")

training_args = TrainingArguments(
    output_dir='results', 
    num_train_epochs=EPOCHS, 
    logging_steps=EPOCHS,
    load_best_model_at_end=True, 
    save_strategy="epoch", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=BATCH_SIZE, 
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=100, 
    weight_decay=0.01, 
    logging_dir='logs',
    report_to="none",
    save_total_limit=15,
    seed=42,
)

Trainer(model=model, 
        args=training_args, 
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: {
            'input_ids': torch.stack([f[0] for f in data]),
            'attention_mask': torch.stack([f[1] for f in data]),
            'labels': torch.stack([f[0] for f in data]),
        }
).train()
sgugger commented 3 years ago

I don't see anything out of the ordinary:

oborchers commented 3 years ago

I see - perhaps because of the tightness I was assuming this was an error from code side, but thinking about this it makes sense that there is very little room for anything overhead. I actually tried to also go for fp16, but that doesn't work either due to nans. I will try a bit more, but probably I'll just let it run until it's done. Many thanks for your help!

oborchers commented 3 years ago

Can safely confirm that it works nicely out of the box with the 125M variant of the model. Thus I will have to play around with Zero or FP16 to understand how to get it to work with the larger ones. Many thanks!

oborchers commented 3 years ago

@sgugger, actually it was much more difficult to get to the result where I wanted to be. I can now train with fp16 enabled and with Zero2 on 3 GPUs (more tests to come with more GPUs). The problem seems to have been resolved by running the container in which the training takes place using with certain args:

docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=0,1,6"' -v $(pwd):/home -v /data/shared/transformers:/var/transformers trainer

where --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 did the trick.

Otherwise I was simply not able to run deepspeed train.py without running into NCCL errors.