Closed stas00 closed 3 years ago
Well, I'm closing this right away, since it's not a bug, but feel free to comment or ask questions in the comments.
(I'm adding to this issue, even though it's closed, because it's directly related)
I am seeing OOM trying to get this to work: 1 GPU, SeqLength 128 (originally tried 256), buffers {2e8, 3e8, 5e8} (just changes the epoch of the OOM), BS=1.
@stas00 , I kept track of the GPU memory (as reported in nvidia-smi) to see if it's a progressive memory leak, but I don't think it is:
Runscript: (Note I am using unifiedqa-t5-11b, which is just a fine-tuned t5-11b -- I don't think that should change anything)
export DATADIR=/home/pajansen/11b-data/ \
export SEQLEN=128 \
export OUTPUTDIR=output_dir \
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir $OUTPUTDIR --adam_eps 1e-06 --data_dir $DATADIR \
--do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length $SEQLEN --max_target_length $SEQLEN --num_train_epochs 2 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler \
--test_max_target_length $SEQLEN --val_max_target_length $SEQLEN \
--warmup_steps 5 \
--deepspeed ds_config.json --fp16 \
Conda environment:
# Make new environment
conda create --name transformers-feb4-2020 python=3.8
conda activate transformers-feb4-2020
# Clone transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
# Install nightly build of Pytorch
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html -U
# Install seq2seq transformers requirements
pip install -r examples/seq2seq/requirements.txt
# Install transformers
pip install -e .
# Install DeepSpeed from source for the A100 support
cd ..
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed/
./install.sh
pip install .
The monster output: oom-feb4-t5-11b.txt
Just the last bit of the output: (the overflow errors are probably noteworthy?)
Using /home/pajansen/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005221366882324219 seconds
[INFO|trainer.py:837] 2021-02-04 15:05:54,964 >> ***** Running training *****
[INFO|trainer.py:838] 2021-02-04 15:05:54,964 >> Num examples = 592
[INFO|trainer.py:839] 2021-02-04 15:05:54,964 >> Num Epochs = 2
[INFO|trainer.py:840] 2021-02-04 15:05:54,964 >> Instantaneous batch size per device = 1
[INFO|trainer.py:841] 2021-02-04 15:05:54,964 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:842] 2021-02-04 15:05:54,964 >> Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-02-04 15:05:54,964 >> Total optimization steps = 1184
0%| | 0/1184 [00:00<?, ?it/s][2021-02-04 15:05:58,447] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
{'loss': inf, 'learning_rate': 0.0, 'epoch': 0.0}
0%|β | 1/1184 [00:03<1:08:20, 3.47s/it][2021-02-04 15:06:02,124] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
0%|β | 2/1184 [00:07<1:09:31, 3.53s/it][2021-02-04 15:06:05,853] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
0%|β | 3/1184 [00:10<1:10:38, 3.59s/it][2021-02-04 15:06:09,757] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
0%|β | 4/1184 [00:14<1:12:26, 3.68s/it][2021-02-04 15:06:13,120] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
0%|β | 5/1184 [00:18<1:10:29, 3.59s/it][2021-02-04 15:06:16,495] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
1%|β | 6/1184 [00:21<1:09:10, 3.52s/it][2021-02-04 15:06:19,825] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
1%|β | 7/1184 [00:24<1:07:59, 3.47s/it][2021-02-04 15:06:23,182] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
1%|ββ | 8/1184 [00:28<1:07:17, 3.43s/it][2021-02-04 15:06:26,854] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
1%|ββ | 9/1184 [00:31<1:08:37, 3.50s/it][2021-02-04 15:06:30,436] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
1%|ββ | 10/1184 [00:35<1:09:01, 3.53s/it][2021-02-04 15:06:33,801] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
1%|ββ | 11/1184 [00:38<1:08:00, 3.48s/it][2021-02-04 15:06:37,147] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
1%|ββ | 12/1184 [00:42<1:07:10, 3.44s/it][2021-02-04 15:06:40,510] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
1%|ββ | 13/1184 [00:45<1:06:40, 3.42s/it][2021-02-04 15:06:43,887] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
1%|βββ | 14/1184 [00:48<1:06:23, 3.40s/it][2021-02-04 15:06:47,250] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
1%|βββ | 15/1184 [00:52<1:06:05, 3.39s/it][2021-02-04 15:06:50,615] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
1%|βββ | 16/1184 [00:55<1:05:52, 3.38s/it][2021-02-04 15:06:53,976] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
1%|βββ | 17/1184 [00:58<1:05:41, 3.38s/it][2021-02-04 15:06:57,313] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
2%|βββ | 18/1184 [01:02<1:05:23, 3.36s/it][2021-02-04 15:07:00,672] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
2%|βββ | 19/1184 [01:05<1:05:18, 3.36s/it][2021-02-04 15:07:04,003] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
2%|ββββ | 20/1184 [01:09<1:05:03, 3.35s/it][2021-02-04 15:07:07,382] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0
2%|ββββ | 21/1184 [01:12<1:05:08, 3.36s/it][2021-02-04 15:07:10,753] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096.0, reducing to 2048.0
2%|ββββ | 22/1184 [01:15<1:05:09, 3.36s/it][2021-02-04 15:07:14,118] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0
2%|ββββ | 23/1184 [01:19<1:05:06, 3.36s/it][2021-02-04 15:07:17,475] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0
2%|ββββ | 24/1184 [01:22<1:05:00, 3.36s/it][2021-02-04 15:07:20,816] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512.0, reducing to 256.0
2%|ββββ | 25/1184 [01:25<1:04:49, 3.36s/it][2021-02-04 15:07:24,174] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256.0, reducing to 128.0
2%|ββββ | 26/1184 [01:29<1:04:46, 3.36s/it]Killing subprocess 3319579
Traceback (most recent call last):
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
main()
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python', '-u', './finetune_trainer.py', '--local_rank=0', '--model_name_or_path', 'allenai/unifiedqa-t5-11b', '--output_dir', 'output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev', '--adam_eps', '1e-06', '--data_dir', '/home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/', '--do_eval', '--do_predict', '--do_train', '--evaluation_strategy=steps', '--freeze_embeds', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '2', '--overwrite_output_dir', '--per_device_eval_batch_size', '1', '--per_device_train_batch_size', '1', '--predict_with_generate', '--sortish_sampler', '--test_max_target_length', '128', '--val_max_target_length', '128', '--warmup_steps', '5', '--deepspeed', 'ds_config.json', '--fp16']' died with <Signals.SIGSEGV: 11>.
Command being timed: "deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir output_dir_compexpl-feb4-epoch2-uqa-11b-wholetree-rev --adam_eps 1e-06 --data_dir /home/pajansen/github/compositional-expl/data/feb4-initialtest-q693/wholetree-rev/ --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 2 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --deepspeed ds_config.json --fp16"
User time (seconds): 1152.16
System time (seconds): 746.75
Percent of CPU this job got: 396%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:58.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 233292336
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 108071918
Voluntary context switches: 38621
Involuntary context switches: 588867
Swaps: 0
File system inputs: 0
File system outputs: 48
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Thank you for the report and the details, @PeterAJansen
In the future, let's try to have a dedicated issue for each unique problem, but since the OP wasn't really an issue, it is now ;) so all is good.
Let me see if I can reproduce the problem with your changes, perhaps my data sample was too short.
The other difference I see is that you're not using --task
which then defaults to summarization
- so we surely don't test the exact same thing.
The allenai/unifiedqa-t5-11b
model looks of identical size to t5-11b
, but let me download the former to make sure that I'm doing an exact reproduction.
Let me see
(the overflow errors are probably noteworthy?)
these are normal. not a problem.
OK, I'm able to reproduce it. The GPU memory usage grows slowly at some times and jumps at quick bump ups of several GBs at other times.
I used buffers of 1e8 and cmd:
export BS=2; rm -rf output_dir; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=1 ./finetune_trainer.py --model_name_or_path allenai/unifiedqa-t5-11b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --deepspeed ds_config.json --fp16
Which means that either transformers (trainer or model) or DeepSpeed or both leak memory. I'm going to switch to a much smaller model size as with this model it takes ages for it to just start - can't develop like this and try to detect where the leak is coming from.
BTW, here is a tip. Currently transformers performs a silly thing - it inits the model, inits the weights, and overwrites all this work with pretrained weights. Which with this model takes like 10 minutes. You can shortcut it with:
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -747,7 +747,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
Initializes and prunes weights if needed.
"""
# Initialize weights
- self.apply(self._init_weights)
+ #self.apply(self._init_weights)
# Prune heads if needed
if self.config.pruned_heads:
which skips 90% of the pointless of weight inits.
I'm trying to advocate for this to be a feature here: https://github.com/huggingface/transformers/issues/9205
Heh, we were assuming it was OOM, but it got SIGSEGV - I didn't bother to look closer - so pytorch w/Deepspeed segfaults pretty much at step 22. Investigating...
No useful info in the core bt. Stripped binaries.
I eliminated the possibility that the issue could be with pytorch.
Most likely a regression in DS.
Downgrading pip install deepspeed==0.3.10
solves the segfault
I must have been using an old DS yesterday and that's why it was working for me.
Trying to locate the faulty commit in DS
And the reason it was happening always at step 22 was because AdamW wasn't running until this step, this is all those skipping step overflow reports:
[2021-02-04 22:40:47,424] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048.0, reducing to 1024.0
0%| | 23/60000 [01:18<55:05:44, 3.31s/it][2021-02-04 22:40:50,837] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0
0%| | 24/60000 [01:21<55:37:22, 3.34s/it][2021-02-04 22:40:54,255] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512.0, reducing to 256.0
As soon as it run it segfaulted.
Hopefully we will have a fix soon, but until then please use deepspeed==0.3.10
Thanks @stas00 !
I have downgraded to deepspeed 0.3.10 and I'm going to leave Transformers running overnight on a proper training job to see if it crashes (it's currently about 20% completed, so that's promising). Though it does appear that the GPU memory usage periodically moves from ~34GB up to nearly the entire 40GB minus a few hundred MB, so it's a real nail biter watching it:
Transformers+DeepSpeed really doesn't believe in wasting RAM... :)
update: DeepSpeed yanked 0.3.11 from pypi, so a normal pip install should now result in a good working 0.3.10 installed until this issue is fixed.
Update on my end: with DeepSpeed 0.3.10 it did run successfully through the night on a full job, successfully training and generating the predictions. Amazing work @stas00 et al.
@stas00 I'm not sure if this is a bug or if I'm just not doing it correctly given how fast most of this is moving, but I'm trying to evaluate/generate predictions post-training and getting not-on-device errors. I should not that it worked fine when I did the whole thing in one command (train/eval/predict) overnight, but now I'm trying to use the fine-tuned model to generate predictions on other data.
I have (a) just removed the --do_train flag from the call to finetune_trainer (and, set the model path to the output path of the fine-tuned model), and this gives an error (below). I've also (b) tried CPU-based eval (--device cpu) with the official instructions in examples/seq2seq/, which gave a different error (but I've not done non-cuda eval before, so that might be my issue).
Here's the error from (A):
[2021-02-05 12:00:30,238] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-05 12:00:30,586] [INFO] [runner.py:355:main] cmd = /home/pajansen/anaconda3/envs/transformers-feb4-2020/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev --output_dir output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev-unannotated --adam_eps 1e-06 --data_dir /home/pajansen/github/compexpl/data/feb4-initialtest-q693/unannotated/ --do_eval --do_predict --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 256 --max_target_length 256 --num_train_epochs 3 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --sortish_sampler --test_max_target_length 256 --val_max_target_length 256 --warmup_steps 5 --deepspeed ds_config.json --fp16
[2021-02-05 12:00:31,464] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-02-05 12:00:31,464] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-02-05 12:00:31,464] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-02-05 12:00:31,464] [INFO] [launch.py:100:main] dist_world_size=4
[2021-02-05 12:00:31,464] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-02-05 12:00:33,681] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:33,788] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:33,908] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 12:00:34,042] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:447] 2021-02-05 12:00:34,625 >> loading configuration file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/config.json
[INFO|configuration_utils.py:485] 2021-02-05 12:00:34,626 >> Model config T5Config {
"_name_or_path": "allenai/unifiedqa-t5-11b",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 65536,
"d_kv": 128,
"d_model": 1024,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"early_stopping": true,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"model_type": "t5",
"n_positions": 512,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"num_decoder_layers": 24,
"num_heads": 128,
"num_layers": 24,
"output_past": true,
"pad_token_id": 0,
"prefix": "summarize: ",
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"transformers_version": "4.3.0.dev0",
"use_cache": true,
"vocab_size": 32128
}
[INFO|configuration_utils.py:447] 2021-02-05 12:00:34,626 >> loading configuration file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/config.json
[INFO|configuration_utils.py:485] 2021-02-05 12:00:34,627 >> Model config T5Config {
"_name_or_path": "allenai/unifiedqa-t5-11b",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 65536,
"d_kv": 128,
"d_model": 1024,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"early_stopping": true,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"model_type": "t5",
"n_positions": 512,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"num_decoder_layers": 24,
"num_heads": 128,
"num_layers": 24,
"output_past": true,
"pad_token_id": 0,
"prefix": "summarize: ",
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"transformers_version": "4.3.0.dev0",
"use_cache": true,
"vocab_size": 32128
}
[INFO|tokenization_utils_base.py:1685] 2021-02-05 12:00:34,627 >> Model name 'output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming 'output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-02-05 12:00:34,627 >> Didn't find file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-02-05 12:00:34,627 >> Didn't find file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/spiece.model
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/special_tokens_map.json
[INFO|tokenization_utils_base.py:1784] 2021-02-05 12:00:34,627 >> loading file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/tokenizer_config.json
WARNING:__main__:Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
WARNING:__main__:Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
WARNING:__main__:Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|modeling_utils.py:1025] 2021-02-05 12:00:34,753 >> loading weights file output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-02-05 12:04:48,021 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.
[INFO|modeling_utils.py:1151] 2021-02-05 12:04:48,034 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at output_dir_compexpl-feb4-epoch3-uqa-11b-wholetree-rev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|trainer.py:348] 2021-02-05 12:04:48,080 >> Using amp fp16 backend
[INFO|trainer.py:1600] 2021-02-05 12:04:48,080 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-05 12:04:48,080 >> Num examples = 1950
[INFO|trainer.py:1602] 2021-02-05 12:04:48,080 >> Batch size = 1
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 327, in main
metrics = trainer.evaluate(metric_key_prefix="val")
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
output = self.prediction_loop(
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
generated_tokens = self.model.generate(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
return F.embedding(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 327, in main
metrics = trainer.evaluate(metric_key_prefix="val")
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
output = self.prediction_loop(
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
generated_tokens = self.model.generate(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
return F.embedding(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 327, in main
metrics = trainer.evaluate(metric_key_prefix="val")
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
output = self.prediction_loop(
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
generated_tokens = self.model.generate(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
return F.embedding(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 327, in main
metrics = trainer.evaluate(metric_key_prefix="val")
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1506, in evaluate
output = self.prediction_loop(
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/trainer.py", line 1630, in prediction_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/home/pajansen/github/transformers-feb4-2021/transformers/examples/seq2seq/seq2seq_trainer.py", line 220, in prediction_step
generated_tokens = self.model.generate(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 847, in generate
model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/github/transformers-feb4-2021/transformers/src/transformers/models/t5/modeling_t5.py", line 878, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
return F.embedding(
File "/home/pajansen/anaconda3/envs/transformers-feb4-2020/lib/python3.8/site-packages/torch/nn/functional.py", line 1921, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Are you on master and not by chance on my experimental t5-pipeline branch? If it's the latter then it's very likely that you'd hit that "not on the current device" error. Please make sure you're using the master transformers
.
Definitely on the master :)
Update: I did figure out the CPU eval error -- I had --fp16 set (as in the example script), which currently throws an esoteric pytorch error on CPU ("threshold_cpu" not implemented for 'Half'). Removing this lets it run on CPU, but with 64 cores T5-11B is evaluating at 150 seconds per generation, instead of less than 1 sec with the GPU, so I think I'll kill that.
@PeterAJansen want to confirm with you one detail, is your setup with Intel or AMD cpu?
It's AMD.
I'm using Peter's machine for debugging this, so you can ask me anything.
@PeterAJansen, glad you sorted it out - let me see if I can reproduce that and we could ensure that we prevent the erroneous fp16/cpu combination in first place.
Update on DeepSpeed: it looks like the segfault over CPU ADAM problem is specific to AMD, which is the case on your computer, so the DeepSpeed team are working on figuring that out and hopefully will have a new release some time soon that will do the right thing on AMD and be fast too.
@PeterAJansen,
I have fixed the first bug where you went for inference without training - please use this PR branch if it's not merged https://github.com/huggingface/transformers/pull/10039
Well basically we aren't using deepspeed at the moment at all if --do_train
wasn't run - need to think how to benefit from Deepspeed for pure inference. I will experiment with that.
wrt --device cpu
could you please explain how you managed to use it? Since it's not a valid flag for finetune_trainer.py
, so if you could share the full cmd that would help to reproduce the problem.
Thank you!
@PeterAJansen, for the future let's do this:
Then:
;)
@PeterAJansen,
- I have fixed the first bug where you went for inference without training - please use this PR branch if it's not merged #10039 Well basically we aren't using deepspeed at the moment at all if
--do_train
wasn't run - need to think how to benefit from Deepspeed for pure inference. I will experiment with that.
Thanks!
- wrt
--device cpu
could you please explain how you managed to use it? Since it's not a valid flag forfinetune_trainer.py
, so if you could share the full cmd that would help to reproduce the problem.Thank you!
Apologies, I think in my exhilaration that it's running T5-11B on 40G cards that I forgot proper issue submission procedures. The --fp16 error is submitted as isssue #10040 :)
both issues have been fixed https://github.com/huggingface/transformers/pull/10039 and https://github.com/huggingface/transformers/pull/10041
@stas00 have you tried profiling Hugging Face models with DeepSpeed's FlopsProfiler
? I'm curious to see what kind of stats you get, especially for decoder-only models such as GPT2LMHeadModel
as you increase the model size.
I haven't tried yet - as I'm busy at the moment at figuring out the pipeline, but I logged that idea here https://github.com/huggingface/transformers/issues/9606 for a later time or if someone else is moved to do it before I get a chance to do so.
I appreciate the suggestion, @g-karthik. I'm like a kid in a candy store, so many things to try, so little time.
@stas00 not sure if this issue is closed and/or I should start a new thread. But my question is very much related. Here goes:
I followed the instructions mentioned here (same deepspeed version, t5-11b. everything same). However on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB) it goes OOM. Does not train even with BS=1 using deepspeed.
Still wondering how you were able to train this on 1x A100-SXM4-40GB since the t5-11b downloaded (automatically by huggingface), pytorch.bin model file itself has a size of β 45GB (raw file size). Just loading the model itself will cause OOM on a 40GB 1x A100-SXM4-40GB.
Am I missing something? or did the t5-11b model size change since this post?
Srikar
Hi @srikar2097,
deepspeed does model.half()
by default so you are only loading 22.5GB in weights. though it did add support for fp32 since that post.
Most likely your seq_len is much larger than the test that I did. Does it work if you reduce it?
Also this is really old now, and you have the offload available so if you have lots of RAM you shouldn't have a problem loading t5-11b on A100-50GB.
If you are still struggling, then yes, by all means please open a new issue and full details on how to reproduce the problem. and tag me please.
FWIW, I remember having a specific commit that seemed to work for T5-11B in the 40gb A100s, and it not working after -- and me mostly using the T5-3B model for speed, so I haven't tried it recently to see if it still works (without the offloading).
@stas00 thanks for the tips. I did try with seq_len=512 with BS=1. Then with seq_len=128 with BS=1 (both times OOM).
For T5-11b on a A100-40B, I guess sticking to fp16 is the way to go since fp32 will load entire model into GPU mem? (which will surely cause OOM since raw model file itself is 45GB).
my host has 1TB RAM, so you suggest to use offload? Do you have some comments on if using offload would slow down training? (since optimizer-states/gradients has to flow back-and-forth between GPU <-> CPU)...
@PeterAJansen I am using T5-3b for now since I haven't yet cracked the code with T5-11b.. appreciate re-affirming my comments that T5-11b is not working for you too...
@stas00 thanks for the tips. I did try with seq_len=512 with BS=1. Then with seq_len=128 with BS=1 (both times OOM).
Please file a new Issue with a full report with config file and command line and then I'd be happy to try to diagnose this with you.
Thank you for experimenting with shorter seq_len.
@PeterAJansen do you remember which commit or perhaps it's logged somewhere in the Issue comments? Could probably git bisect
to find it.
For T5-11b on a A100-40B, I guess sticking to fp16 is the way to go since fp32 will load entire model into GPU mem? (which will surely cause OOM since raw model file itself is 45GB).
correct!
my host has 1TB RAM, so you suggest to use offload? Do you have some comments on if using offload would slow down training? (since optimizer-states/gradients has to flow back-and-forth between GPU <-> CPU)...
I don't have numbers to share yet, but the offload protocol is written to pre-fetch data, so the overhead in theory should be minimal. so absolutely yes to offload.
@stas00 I have a feeling it might be c130e67d
, or failing that something on or around February 12th 2021.
OK, I'm able to train t5-11b on a single A100-SXM4-40GB with seq len 1024 with BS=4 at about 40GB gpu mem usage with deepspeed zero2:
export BS=4; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-11b --output_dir output_dir \
--adam_eps 1e-06 --evaluation_strategy=steps --do_train --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 500 --max_source_length 1024 --max_target_length 1024 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 \
--dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length \
128 --warmup_steps 50 --max_train_samples 2000 --max_eval_samples 50 --deepspeed \
tests/deepspeed/ds_config_zero2.json --fp16
let's log for posterity (both master HEAD as of this writing)
$ cd transformers
$ git rev-parse --short HEAD
61c506349
$ cd ../deepspeed
ccc522c
surprisingly zero3 with full offload OOMs! Need to figure that one out.
Thanks to @PeterAJansen for letting me use his rig.
OK, @samyam helped me to figure out ZeRO-3 - getting a 3.5x larger BS than with zero2. The key was to lower:
"sub_group_size": 1e9,
from 1e14
.
So, I'm able to train t5-11b on a single A100-SXM4-40GB with seq len 1024 with BS=14 with deepspeed ZeRO-3:
export BS=14; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-11b --output_dir output_dir \
--adam_eps 1e-06 --evaluation_strategy=steps --do_train --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 500 --max_source_length 1024 --max_target_length 1024 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS \
--predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 \
--dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length \
128 --warmup_steps 50 --max_train_samples 2000 --max_eval_samples 50 --deepspeed \
tests/deepspeed/ds_config_zero3.json --fp16
everything else is the same as in the zero-2 post above, and config file is too from transformers @ 61c506349 , but ds_config_zero3.json
needs to be changed as shown above.
I'd like to mention that the code above uses dynamic padding, which doesn't pad to length 1024, so the input and output are not 1024. Turning on "--pad_to_max_length True" results in OOM, unfortunately, with even low batch size of 1. I tried length 512 as well with batch size 1 but also got out of memory.
Is there a way to use zero stage 3 for applications where long sequences are needed (512+)?
Thank you for this report, @benathi
First I just want to validate that you're referring to the setup from my most recent comment and not the OP.
So what you're suggesting is that being able to use a largish BS was nothing but a fluke since the dataset entries happened to be quite short, correct?
Have you tried using a smaller BS?
Also do you have access to a single card only?
Yes I refer to your most recent comment. I tried 1 GPU (using A100 same as you) and 2 and 8.
I tried using batch size as small as 1 for length 512 (input 512 output 512) but ran into memory issues for 1,2,8 GPUs
I suspect that for it is due to memory surge during attention computation, which can be quite a lot for long sequence. Im not sure what is needed to overcome this. I tried changing the bucket size in the config to no avail.
If I donβt use ββpad_to_max_length Trueβ, I can run your exact script (input 1024 output 1024) just fine with 1,2,8 GPUs.
Best, Ben
On Thu, Sep 16, 2021 at 11:02 PM Stas Bekman @.***> wrote:
Thank you for this report, @benathi https://github.com/benathi
First I just want to validate that you're referring to the setup from my most recent comment https://github.com/huggingface/transformers/issues/9996#issuecomment-856384448 and not the OP.
So what you're suggesting is that being able to use a largish BS was nothing but a fluke since the dataset entries happened to be quite short, correct?
Have you tried using a smaller BS?
Also do you have access to a single card only?
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/9996#issuecomment-921417244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5DMSZM2YQIB5E3BXWB2O3UCKVSXANCNFSM4XCHBJ4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@benathi if the issue is in fact the long sequence length (which is plausible), then the fix I would recommend is to use deepspeed activation checkpointing. That would significantly reduce the activation memory consumption. But before going to that route, please check with seq length 32, 64, 128, 256 as well to see if you are able to run with a smaller fixed sequence length with pad_to_max_length True, and you are running into OOM only after you increase the seq_length above a certain threshold. If you are still OOMing even with a small max seq length like 32 when pad_to_max_length is True, then the issue might be something else related to that flag.
Thank you for the feedback and great suggestions, @samyam! I keep forgetting about "activation checkpointing".
Thank you @samyam. Good to hear from you! I'll look further into activation checkpointing :)
I can confirm that it runs ok with lower context length. :)
@stas00 I looked through HF documentation and my impression is that activation checkpointing is not supported out of the box. Is this correct? If so, is there any suggestion you can provide regarding how to do activation checkpointing with HF models?
It's just named gradient_checkpointing
in transformers
, and most models support this feature.
To enable it you need to do:
model.config.gradient_checkpointing = True
before using the model anywhere. You can see an example of it being activated here:
For example
scripts there is no direct cli arg, In in language-modeling
scripts you can cheat by passing:
--config_overrides "gradient_checkpointing=True"
More details are at https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/README.md#creating-a-model-on-the-fly
Perhaps it's about time we exposed this flag in HF Trainer.
Yet another way to cheat if none of the above is fitting:
config.json
to enable gradient_checkpointing
This will work with any example script.
Please let me know if you were successful. And then we will sort out how to enable it easier and document the synonyms so it's easier to search and find.
Thank you for your prompt reply!! And for your hard work on the HF library which everybody loves. :) Iβll take a look at this.
Best,
On Fri, Sep 17, 2021 at 7:23 PM Stas Bekman @.***> wrote:
It's just named gradient_checkpointing in transformers, and most models support this feature.
To enable it you need to do:
model.config.gradient_checkpointing = True
before using the model anywhere. You can see an example of it being activated here:
For example scripts there is no direct cli arg, In a few scripts you can cheat by passing:
--config_overrides "gradient_checkpointing=True"
in language-modeling scripts. More details are at https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/README.md#creating-a-model-on-the-fly
Perhaps it's about time we exposed this flag in HF Trainer.
Yet another way to cheat if none of the above is fitting:
- clone the model locally
- edit config.json to enable gradient_checkpointing
- pass the local path to the cloned model instead of the model name this will work with any example script.
Please let me know if you were successful.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/9996#issuecomment-922131357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5DMS75ZPXH2REAOOCKTSDUCPEYXANCNFSM4XCHBJ4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Thank you for your kind words, @benathi!
I am able to run with sequence length 2000 (I made sure to pad the data to that long) with 29GB per GPU with activation checkpointing. For 1024, the GPU consumption is 12.5GB. Without activation checkpointing, I can run up to sequence length 256 and got OOM at 512.
All of this uses batch size = 1 and 8 GPUs with Zero 3.
Thank you for reporting back, @benathi - so a partial success.
Have you seen the memory usage estimators at https://deepspeed.readthedocs.io/en/stable/memory.html
It'd be great to complete the existing set with an activations memory usage estimator and then it'd remove most of the guesswork / needing to try as we would have the requirements known instantly.
But may be let's start with what is there already. Could you put in the numbers for your setup and let's see how much opt/grad/params should be consuming under z3? Thank you.
and @sgugger is working on making it easy to activate the gradient/activation checkpointing https://github.com/huggingface/transformers/pull/13657
I would say it is a success, not even partial, since I was able to run up to sequence length 2000! :)
I did try to use the memory estimator. I think the estimator doesn't take into account the activations or batch size? (not sure) so it's a bit hard to gauge from the estimator alone.
Anyways I'm happy I can train with a relatively large context length now.
Thank you for confirming that you needs have been met, @benathi
Yes, the estimators are missing the activation component, which is crucial. But since the latter component is the same regardless of DS setup, the existing estimators at least can show you where the memory can be saved.
Actually another question for you if you don't mind :)
If I want to use even longer sequence length, in which case sparse attention is probably necessary, is switching to GPT-Neo all it takes to do sparse attention? (and turning on sparse attention in deepspeed) Or are there other config that allows me to do sparse attention for gpt2 as well? Not sure if there's some field in the config I can just turn on to use sparse attention :)
I'd love to answer your question, @benathi, but I haven't had a chance to experiment with this feature yet. Perhaps asking at https://discuss.huggingface.co/?
HF arsenal has several models that implement sparse attention natively: https://huggingface.co/blog/long-range-transformers
Deepspeed implements sparse attention, but I am not sure how we would plug it into HF Transformers. That is it has this section of the config file, but I think it only works with some of their internal features. I don't know. Might it be a good idea to ask at https://github.com/microsoft/DeepSpeed - I'd love to know the answer myself - and if we could integrate that into Transformers. If you'd like to take the lead on the research I'd be happy to help integrating it. If you ask please tag me as well.
Thank you!
@stas00 I see the the ds_config.json uses "auto" casting. I cannot train a 13B multilingual mT5-xxl model on the 8x40GB A100 on aws p4d24xlarge
. I am using This config with "fp16": {"enabled": false,
as t5 is trained on bfloat16 and fp16 usually produce NaN. My sequence length is "src_input_length=1024", target_input_length=256".
Do you have any suggestion? Should I move to fairscale for fp16
issue?
"auto" just allows converting --fp16
to "true" if it's passed in the trainer args. You can absolutely hardcode it to what you need.
I made a possible workaround for t5/mt5 overflows which worked some and not for others, you may want to try: https://github.com/huggingface/transformers/pull/10956
Ideally, especially since you're using A100, you should train in bf16 mixed precision, the work is being done on it here: https://github.com/huggingface/transformers/pull/13207
But deepspeed doesn't yet support bf16 - perhaps it'd be beneficial to ask Deepspeed about supporting bf16 by opening a feature request at https://github.com/microsoft/DeepSpeed/issues - If you feel inspired to do so?
Should I move to fairscale for fp16 issue?
If fairscale gives a working solution then by all means use it. Does it? I just don't know the answer.
Megatron-LM released a t5 model recently but it doesn't yet support pipeline, so if tensor parallelism is sufficient to your setup it might do a trick (transformers will have TP shortly as well). You can ping them asking when PP will be added. I doubt that if nobody asks it'll happen any time soon. Their bert/gpt2 have a full dp/tp/pp support, but not yet t5.
Finally, try activating Gradient Checkpointing which should help a lot to lower memory usage: https://huggingface.co/transformers/performance.html#gradient-checkpointing
Thanks a lot @stas00 for your reply.
I have been working with your PR https://github.com/huggingface/transformers/pull/10956 until now. Just to let you know, it works fine for me. Huge thanks to you for that PR.
But so far I remember Deepspeed doesn't support torch.cuda.amp.autocast(enabled=False):
so ffn layer weights remain fp16 in deepspeed.
I've already tried gradient-checkpointing
with fp32 training (in deepspeed) for mT5-xxl-13B but OOM.
May be in coming day I will at first try fair-scale to be sure since it supports torch.cuda.amp.autocast(enabled=False):
.
Thanks a lot @stas00 for your reply. I have been working with your PR #10956 until now. Just to let you know, it works fine for me. Huge thanks to you for that PR.
Glad to hear that!
But so far I remember Deepspeed doesn't support
torch.cuda.amp.autocast(enabled=False):
so ffn layer weights remain fp16 in deepspeed. I've already triedgradient-checkpointing
with fp32 training (in deepspeed) for mT5-xxl-13B but OOM.
DS uses their own mixed precision which doesn't lend to users overriding it. But it should be possible to make an if code branch that if the code is running under deepspeed we could manually upcast to fp32 and then downcast back to fp16 and deepspeed. Let me know if you need help with that, this would require no deepspeed understanding I believe. And I haven't tried that, so it's possible that my idea may or may not work.
May be in coming day I will at first try fair-scale to be sure since it supports
torch.cuda.amp.autocast(enabled=False):
.
Do you mean the sharded DDP (ZeRO@fairscale)? Do let us know, I have no idea what is the state of that project nowadays.
@stas00 any idea about this, I keep getting overflow. Using Version: 0.5.3 of deepseed due to torch restrictions I can't solve this even after several attempts
[2021-11-13 19:22:08,401] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16.0, reducing to 8.0 0%| | 14/24128 [00:54<25:52:50, 3.86s/it] [2021-11-13 19:22:12,194] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8.0, reducing to 4.0 0%| | 15/24128 [00:58<25:44:14, 3.84s/it] [2021-11-13 19:22:15,963] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4.0, reducing to 2.0 0%| | 16/24128 [01:02<25:35:10, 3.82s/it] [2021-11-13 19:22:19,775] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2.0, reducing to 1.0 0%| | 17/24128 [01:06<25:34:08, 3.82s/it] [2021-11-13 19:22:23,570] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1.0, reducing to 1 0%| | 18/24128 [01:10<25:31:20, 3.81s/it] [2021-11-13 19:22:27,338] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%| | 19/24128 [01:13<25:26:08, 3.80s/it] [2021-11-13 19:22:31,100] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 20/24128 [01:17<25:21:41, 3.79s/it] [2021-11-13 19:22:34,909] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 21/24128 [01:21<25:24:20, 3.79s/it] [2021-11-13 19:22:38,715] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 22/24128 [01:25<25:25:39, 3.80s/it] [2021-11-13 19:22:42,709] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 23/24128 [01:29<25:49:22, 3.86s/it] [2021-11-13 19:22:46,705] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 24/24128 [01:33<26:06:45, 3.90s/it] [2021-11-13 19:22:50,537] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 25/24128 [01:37<25:57:46, 3.88s/it] [2021-11-13 19:22:54,437] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 26/24128 [01:40<26:00:36, 3.89s/it] [2021-11-13 19:22:58,333] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 27/24128 [01:44<26:01:38, 3.89s/it] [2021-11-13 19:23:02,162] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 28/24128 [01:48<25:54:33, 3.87s/it] [2021-11-13 19:23:05,991] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 29/24128 [01:52<25:49:28, 3.86s/it] [2021-11-13 19:23:09,884] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 30/24128 [01:56<25:53:38, 3.87s/it] [2021-11-13 19:23:13,776] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1 0%|β | 31/24128 [02:00<25:56:27, 3.88s/it] [2021-11-13 19:23:17,659] [INFO] [stage3.py:2731:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
This looks like an issue to report on the deepspeed side, @tuhinjubcse. https://github.com/microsoft/DeepSpeed/issues
Managed to train t5-11b on 1x 40GB gpu w/ Deepspeed (A100-SXM4-40GB)
Thank you, @PeterAJansen for letting me use your hardware!
Thank you, @jeffra and @samyam, for not believing that it is not possible to train t5-11b on 1x 40GB gpu w/ Deepspeed and supporting me that lead me to find a few bugs in the integration.
Sharing details for those who need.
If you want to try this at home please make sure you use transformers master as some bug fixes were just merged in
Well, it's similar to the t5-3b on 24GB success reported here and here. But this time t5-11b on 1x 40GB gpu (or 4x if you wanted things faster)
As someone asked me before you need a huge amount of general RAM to use ZeRO-Offload for a huge model:
I was using
/usr/bin/time -v program
to get the peak memory measurement - it's theMaximum resident set size
entry in the final report.Question: I don't think
/usr/bin/time
does the right thing for multi-process - I think it only measures the parent process. e.g. with 4x gpus it reported only 102GB RAM, but I clearly saw in top that it was around 240GB. If you have an easy way to measure peak memory that takes into an account forked processes I'm all ears.Batch sizes on one gpu:
I'm referring to these batch sizes in
ds_config.json
:And I tested for 2x and 4x DDP as well, BS=16 OOMed, BS=8 was good so I used that - but could probably squeeze some more.
edit1: later tests show that my test was too short and wasn't getting the CPU Adam optimizer kick in, as it skips the first 20 or so tests because of the overflow. So once it kicks in it takes more GPU memory, so the practical BS is much smaller - I think around 2 on this setup. So most likely you will need to use
BS=2
for real work, until things get optimized even more.edit2: things are getting re-shuffling in the tests, so the default
ds_config.json
file has moved in master to a new, hopefully permanent home. It's now atexamples/tests/deepspeed/ds_config.json
so you will need to adjust the command line to reflect this new location or simply copy it over to where the old one used to be.here is the full benchmark:
Checkpointing should allow making even bigger batch sizes.