Split models to multiple GPUs

dkajtoch commented 4 years ago

I am willing to fine-tune GPT2-large which simply does not fit into GPU memory. I wanted to run the script run_lm_finetuning.py with GPT2-large having two Nvidia Tesla P100, but I suppose model splitting in not supported. Or am I wrong?

LysandreJik commented 4 years ago

Indeed, as of now we don't support model splitting across different GPUs. However, I believe Tesla P100s have 16gb (or 12?) of VRAM and GPT-2 XL fits in ~7-8gb of VRAM. Do you get an OOM error when loading GPT-2 large in memory?

dkajtoch commented 4 years ago

Thanks @LysandreJik. I trained gpt2-medium and it took almost the whole ram ~15gb. When I tried the same with gpt2-large the script was interrupted with "Killed" message twice and I didn't try further.

anandhperumal commented 4 years ago

@LysandreJik XL needs around 7gb to do an inference but for finetuning it needs more. @dkajtoch did you try reducing your batch size?

dkajtoch commented 4 years ago

@anandhperumal I have batch size set to 1 and gradient accumulation steps set to 32. I am running on Google Cloud's dedicated virtual machine for deep learning with pytorch 1.2 and cuda 10.0. I can investigate it further if you direct me.

I am finetuning gpt2-medium right now and here is a screenshot from nvidia-smi

anandhperumal commented 4 years ago

@dkajtoch for time being keep the gradient accumulation to 1 and let me know if it is able to run for 1 batch?

dkajtoch commented 4 years ago

@anandhperumal here is what I get when trying to run gpt2-large on Google Colab with Nvidia P100:

12/10/2019 21:26:39 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
12/10/2019 21:26:39 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json not found in cache or force_download set to True, downloading to /tmp/tmprqss7xx9
100% 529/529 [00:00<00:00, 394731.69B/s]
12/10/2019 21:26:39 - INFO - transformers.file_utils -   copying /tmp/tmprqss7xx9 to cache at /root/.cache/torch/transformers/c8f887cdfff4327916f4b7ed06a379c0add42bd9c66e1fe3b4a5a8525a4b2678.bc44facd742477605da5434f20a32607ead98e78fff95c5ca9523e47b453e1ad
12/10/2019 21:26:39 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/c8f887cdfff4327916f4b7ed06a379c0add42bd9c66e1fe3b4a5a8525a4b2678.bc44facd742477605da5434f20a32607ead98e78fff95c5ca9523e47b453e1ad
12/10/2019 21:26:39 - INFO - transformers.file_utils -   removing temp file /tmp/tmprqss7xx9
12/10/2019 21:26:39 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json from cache at /root/.cache/torch/transformers/c8f887cdfff4327916f4b7ed06a379c0add42bd9c66e1fe3b4a5a8525a4b2678.bc44facd742477605da5434f20a32607ead98e78fff95c5ca9523e47b453e1ad
12/10/2019 21:26:39 - INFO - transformers.configuration_utils -   Model config {
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "finetuning_task": null,
  "initializer_range": 0.02,
  "is_decoder": false,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 1280,
  "n_head": 20,
  "n_layer": 36,
  "n_positions": 1024,
  "num_labels": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 50257
}

12/10/2019 21:26:39 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json not found in cache or force_download set to True, downloading to /tmp/tmphav3yghk
100% 1042301/1042301 [00:00<00:00, 6030201.52B/s]
12/10/2019 21:26:40 - INFO - transformers.file_utils -   copying /tmp/tmphav3yghk to cache at /root/.cache/torch/transformers/69f8d734111f39eaa51a85907bfdc81a7ef42242d638ffab6f77df305402b2b2.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
12/10/2019 21:26:40 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/69f8d734111f39eaa51a85907bfdc81a7ef42242d638ffab6f77df305402b2b2.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
12/10/2019 21:26:40 - INFO - transformers.file_utils -   removing temp file /tmp/tmphav3yghk
12/10/2019 21:26:40 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt not found in cache or force_download set to True, downloading to /tmp/tmpnslvtbfy
100% 456318/456318 [00:00<00:00, 3892131.92B/s]
12/10/2019 21:26:40 - INFO - transformers.file_utils -   copying /tmp/tmpnslvtbfy to cache at /root/.cache/torch/transformers/38d28acc17953e356348dca948e152c653c0ccf5058a552eea30168e27f02046.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
12/10/2019 21:26:40 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/38d28acc17953e356348dca948e152c653c0ccf5058a552eea30168e27f02046.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
12/10/2019 21:26:40 - INFO - transformers.file_utils -   removing temp file /tmp/tmpnslvtbfy
12/10/2019 21:26:40 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json from cache at /root/.cache/torch/transformers/69f8d734111f39eaa51a85907bfdc81a7ef42242d638ffab6f77df305402b2b2.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
12/10/2019 21:26:40 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-merges.txt from cache at /root/.cache/torch/transformers/38d28acc17953e356348dca948e152c653c0ccf5058a552eea30168e27f02046.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
12/10/2019 21:26:40 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmppfw2_223
100% 3247202234/3247202234 [01:12<00:00, 44997623.14B/s]
12/10/2019 21:27:53 - INFO - transformers.file_utils -   copying /tmp/tmppfw2_223 to cache at /root/.cache/torch/transformers/bcc61dff8b1b03d0fd33a1eb1dc4db00875cae33296848155c6882d4bab03db4.999a50942f8e31ea6fa89ec2580cb38fa40e3db5aa46102d0406bcfa77d9142d
12/10/2019 21:28:05 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/bcc61dff8b1b03d0fd33a1eb1dc4db00875cae33296848155c6882d4bab03db4.999a50942f8e31ea6fa89ec2580cb38fa40e3db5aa46102d0406bcfa77d9142d
12/10/2019 21:28:05 - INFO - transformers.file_utils -   removing temp file /tmp/tmppfw2_223
12/10/2019 21:28:06 - INFO - transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin from cache at /root/.cache/torch/transformers/bcc61dff8b1b03d0fd33a1eb1dc4db00875cae33296848155c6882d4bab03db4.999a50942f8e31ea6fa89ec2580cb38fa40e3db5aa46102d0406bcfa77d9142d
12/10/2019 21:28:44 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1024, cache_dir='', config_name='', device=device(type='cuda'), do_eval=False, do_lower_case=False, do_train=True, eval_all_checkpoints=False, eval_data_file=None, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=6e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_steps=1, mlm=False, mlm_probability=0.15, model_name_or_path='gpt2-large', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='finetuning', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=50, save_total_limit=None, seed=42, server_ip='', server_port='', tokenizer_name='', train_data_file='shakespeares.txt', warmup_steps=0, weight_decay=0.0)
12/10/2019 21:28:44 - INFO - __main__ -   Creating features from dataset file at 
12/10/2019 21:28:51 - INFO - __main__ -   Saving features into cached file gpt2-large_cached_lm_1024_shakespeares.txt
12/10/2019 21:28:51 - INFO - __main__ -   ***** Running training *****
12/10/2019 21:28:51 - INFO - __main__ -     Num examples = 1783
12/10/2019 21:28:51 - INFO - __main__ -     Num Epochs = 1
12/10/2019 21:28:51 - INFO - __main__ -     Instantaneous batch size per GPU = 1
12/10/2019 21:28:51 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 1
12/10/2019 21:28:51 - INFO - __main__ -     Gradient Accumulation steps = 1
12/10/2019 21:28:51 - INFO - __main__ -     Total optimization steps = 1
Epoch:   0% 0/1 [00:00<?, ?it/s]
Iteration:   0% 0/1783 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/transformers/examples/run_lm_finetuning.py", line 594, in <module>
    main()
  File "/content/transformers/examples/run_lm_finetuning.py", line 546, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "/content/transformers/examples/run_lm_finetuning.py", line 261, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 549, in forward
    inputs_embeds=inputs_embeds)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 460, in forward
    head_mask=head_mask[i])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 232, in forward
    head_mask=head_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 193, in forward
    attn_outputs = self._attn(query, key, value, attention_mask, head_mask)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_gpt2.py", line 145, in _attn
    w = torch.matmul(q, k)
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.90 GiB total capacity; 15.16 GiB already allocated; 11.88 MiB free; 34.49 MiB cached)

Script is executed with the following flags:

!python /content/transformers/examples/run_lm_finetuning.py \
    --train_data_file=shakespeares.txt \
    --output_dir=finetuning \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-large \
    --do_train \
    --per_gpu_train_batch_size=1 \
    --gradient_accumulation_steps=1 \
    --learning_rate=0.00006 \
    --max_steps=1

dkajtoch commented 4 years ago

BTW from gpt2-simple repo

SarikGhazarian commented 4 years ago

I am facing the same issue. I am able to fine-tune gpt2 and gpt2-medium but not the gpt2-large. I tried batch_size=1 and gradient_accumulation_steps=1 but still have the same issue.

anandhperumal commented 4 years ago

@dkajtoch inference would never take too much of memory. Can you try loading the model into your GPU and tell us how much memory is being used? and did you try apex?

dkajtoch commented 4 years ago

@anandhperumal I loaded the models with the following commands in Colab:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
model.to(torch.device("cuda"))
!nvidia-smi

and gpt2-medium takes about 2GB whereas gpt2-large ~3.6GB. I haven't tried apex cause I do not know what that is. Just wanted to know if it is possible to train gpt2-large or higher on gpu, but it seems it is not.

dkajtoch commented 4 years ago

Apex installed, flag fp16 set and the same out of memory error

anandhperumal commented 4 years ago

@dkajtoch I ran the following code on Colab it works perfectly fine. I would recommend you to write your own code rather than using huggingface code.

dkajtoch commented 4 years ago

Thanks @anandhperumal. That is a positive message. So it can work on gpu, but it does not with huggingface script. Maybe this needs further investigation and a fix could be pushed.

anandhperumal commented 4 years ago

@dkajtoch you can still use the huggingface library but just don't use the run_lm_finetuning.py or debug it your self. It would be great to investigate this problem but it is very subtle. Anyways, I think you can train your model with your own script.

dkajtoch commented 4 years ago

Right @anandhperumal !

PyxAI commented 4 years ago

I am dealing with long sentences and found that setting block_size overcame the out of memory issue. I had batch size = 1 and gradient accumulation = 1 and still got out of memory until on Tesla p100 (16GB) Until I used this to truncate the input sentences. Not sure how it will affects the quality of the results yet though.

anandhperumal commented 4 years ago

if block_size is the problem for you then rather than truncating the over all input sequence you can change the code to handle batch wise max length that should help you.

PyxAI commented 4 years ago

@anandhperumal The code already handles the length per batch with args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)

anandhperumal commented 4 years ago

@PyxAI You tried for even batch size of 1 so what is your max sequence length ? what kind of dataset are you using.

huggingface / transformers

Split models to multiple GPUs #2114