Closed oborchers closed 3 years ago
When you use python train.py
, you use PyTorch DataParallel
behind the scenes which only computes gradients and optimizer states on the main GPU. This is why you see this unbalance in memory usage.
When using python -m torch.distributed.launch --nproc_per_node=2 train.py
(which is the recommended way according to the PyTorch documentation) each GPU will have a copy of the gradients and optimizer states so the memory usage will be balanced across GPUs.
In both cases the number of GPUs should not affect the fact you go OOM or not, unless you have batches of dynamic sizes. In this case, it's best to ensure the largest batches come first so you see the OOM as soon as possible.
@sgugger: Many thanks for the fast reply. That's why I am so confused why this ain't working (and I've played around with it quite a lot). The batch size is not dynamic (because the tokenizer is set to max_length
, which is also set to 512) and is always set to 2 or 1 for this very example.
The code runs completely containerized, the container only has access to specific device IDs. The devices are not occupied otherwise by any other services or processes.
Let me describe the results better:
python train.py
and 2 GPU available + batch size = 2:+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 68C P0 280W / 300W | 32212MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 57C P0 287W / 300W | 16096MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Speed: 1.19s/it
python train.py
and 3 GPU available + batch size = 2:Traceback (most recent call last):
File "train.py", line 142, in <module>
Trainer(model=model,
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1280, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1791, in training_step
loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 87, in apply
return self._forward_cls.backward(self, *args) # type: ignore[attr-defined]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 34, in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py", line 45, in forward
return comm.reduce_add_coalesced(grads_, destination)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced
flat_result = reduce_add(flat_tensors, destination)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/comm.py", line 95, in reduce_add
result = torch.empty_like(inputs[root_index])
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 29.39 GiB already allocated; 25.75 MiB free; 30.20 GiB reserved in total by PyTorch)
Speed: None
python train.py
and 3 GPUs available + batch size = 1: -> Trains, but runtime is much slower as with 2 GPUs + batch size = 1+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 67C P0 288W / 300W | 32232MiB / 32510MiB | 89% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 55C P0 291W / 300W | 11554MiB / 32510MiB | 92% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 |
| N/A 57C P0 292W / 300W | 11530MiB / 32510MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Speed: 1.06s/it
python -m torch.distributed.launch --nproc_per_node=1 train.py
and 3 GPU available + batch size = 1:+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 66C P0 280W / 300W | 31868MiB / 32510MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 42C P0 41W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 |
| N/A 41C P0 43W / 300W | 3MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Speed: 1.69it/s
python -m torch.distributed.launch --nproc_per_node=2 train.py
and 3 GPU available + batch size = 2:OOM despite the fact that this works perfectly fine with python train.py
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 31.75 GiB total capacity; 29.71 GiB already allocated; 5.75 MiB free; 30.28 GiB reserved in total by PyTorch)
0%| | 1/150780 [00:01<66:55:39, 1.60s/it]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2360) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
This is the training code I'm using:
model_name = "EleutherAI/gpt-neo-1.3B"
EPOCHS = 10
BATCH_SIZE = 2
import os
import re
import torch
import random
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch
USER_TOKEN = "<user>"
BOT_TOKEN = "<bot>"
class DialogData(Dataset):
def __init__(self, dialogs, tokenizer, max_length):
self.input_ids = []
self.attn_masks = []
self.labels = []
for dialog in dialogs:
spans = []
# ...
prep_txt = (
"".join(spans) +
"<|endoftext|>"
)
encodings_dict = tokenizer(
prep_txt,
truncation=True,
max_length=max_length,
padding="max_length"
)
# append to list
self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
self.labels.append(torch.tensor(encodings_dict['input_ids']))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.attn_masks[idx], self.labels[idx]
model_suffix = model_name.split("/")[-1]
print(f"Training: {model_name}")
print(f"Suffix: {model_suffix}")
torch.manual_seed(42)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
bos_token='<|startoftext|>',
eos_token='<|endoftext|>',
pad_token='<|pad|>'
)
tokenizer.add_tokens([USER_TOKEN, BOT_TOKEN])
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
model.resize_token_embeddings(len(tokenizer))
dialogs = [f"{USER_TOKEN}I am an arbitrary training dataset" for _ in range(10_000)]
X_train, X_test= train_test_split(
dialogs,
shuffle=True,
test_size=0.05,
random_state=1,
)
train_dataset = DialogData(X_train, tokenizer, max_length=512)
eval_dataset = DialogData(X_test, tokenizer, max_length=512)
print(f"Training dataset: {len(train_dataset)}")
print(f"Evaluate dataset: {len(eval_dataset)}")
training_args = TrainingArguments(
output_dir='results',
num_train_epochs=EPOCHS,
logging_steps=EPOCHS,
load_best_model_at_end=True,
save_strategy="epoch",
evaluation_strategy="epoch",
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
warmup_steps=100,
weight_decay=0.01,
logging_dir='logs',
report_to="none",
save_total_limit=15,
seed=42,
)
Trainer(model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=lambda data: {
'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data]),
}
).train()
I don't see anything out of the ordinary:
I see - perhaps because of the tightness I was assuming this was an error from code side, but thinking about this it makes sense that there is very little room for anything overhead. I actually tried to also go for fp16
, but that doesn't work either due to nan
s. I will try a bit more, but probably I'll just let it run until it's done. Many thanks for your help!
Can safely confirm that it works nicely out of the box with the 125M
variant of the model. Thus I will have to play around with Zero or FP16 to understand how to get it to work with the larger ones. Many thanks!
@sgugger, actually it was much more difficult to get to the result where I wanted to be. I can now train with fp16
enabled and with Zero2
on 3 GPUs (more tests to come with more GPUs). The problem seems to have been resolved by running the container in which the training takes place using with certain args:
docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=0,1,6"' -v $(pwd):/home -v /data/shared/transformers:/var/transformers trainer
where --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
did the trick.
Otherwise I was simply not able to run deepspeed train.py
without running into NCCL errors.
Environment info
transformers
version: 4.9.1Who can help
@sgugger @patil-suraj
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
I'm running the
Trainer
class and I'm essentially just fine tune a GPT-Neo variant. I don't use any specific CLI options and just callpython train.py
.What happens? With
EleutherAI/gpt-neo-1.3B
I am running into CUDA OOM memory errors depending on how much GPUs I want to use for training. For example:So effectively I am unable to train with more than 2 GPUs.
The memory consumption on those two GPUs is also very imbalanced:
I also tried running the training script with the
torch.distributed
command, but that doesn't work either for me. For example:Am I missing something obvious?
Expected behavior
The trainer should be able to handle more GPUs than 2.