Open tuhinjubcse opened 2 years ago
@tuhinjubcse, thanks for reporting this error. Can you please share how to repro on our side? Thanks!
My script from transformers repo
export BS=8;
PYTHONPATH=../../src
USE_TF=0
deepspeed --num_gpus=3 ./finetune_trainer.py \
--data_dir /home/tuhin.chakr/gpt3/poetrynew \
--output_dir /local/nlp/temp/poetryT5-11B_new \
--model_name_or_path t5-11b \
--do_train \
--task translation \
--max_source_length 128 \
--max_target_length 128 \
--save_strategy=epoch \
--num_train_epochs 1 \
--per_device_train_batch_size $BS \
--adafactor \
--learning_rate 1e-3 \
--deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json \
--fp16
~
My deepspeed config
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"train_batch_size": 24,
"train_micro_batch_size_per_gpu": 8,
"gradient_clipping": "auto",
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
Name: deepspeed Version: 0.5.6
Name: torch Version: 1.10.0
Name: transformers Version: 4.12.2
I suspect this pending PR might fix this issue, can you give it a try? There’s one fix that needs to be applied before we can merge but that should be unrelated to your issue I believe.
@jeffra Do you know what I should do exactly? Do I need to make any changes in deepspeed code ?
To give it a try you should be able to reinstall deepspeed but specifically from this branch: https://github.com/microsoft/DeepSpeed/tree/zero-ckpt-cpu-issue
You shouldn’t need any code changes on your side.
You should also be able to pip install this version via: pip install git+https://github.com/microsoft/deepspeed.git@zero-ckpt-cpu-issue
Many thanks @jeffra . This worked
I have one small question my LR was mentioned in my script as 1e-3
json = {
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 0
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2.000000e+08,
"contiguous_gradients": true
},
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 8,
"gradient_clipping": 1.0,
"steps_per_print": 2.000000e+03,
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
When my training loss is printed it shows learning_rate as 0.0, Do you know why ? Is this because of WarmUpLR ?
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}
Many thanks @jeffra . This worked
I have one small question my LR was mentioned in my script as 1e-3
json = { "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.001, "warmup_num_steps": 0 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2.000000e+08, "contiguous_gradients": true }, "train_batch_size": 32, "train_micro_batch_size_per_gpu": 8, "gradient_clipping": 1.0, "steps_per_print": 2.000000e+03, "wall_clock_breakdown": false, "zero_allow_untested_optimizer": true }
When my training loss is printed it shows learning_rate as 0.0, Do you know why ? Is this because of WarmUpLR ?
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02} {'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}
@tuhinjubcse . Same problem happened when I was fine-tuning the T5-3B model using huggingface. I tried to adjust the hyper-parameters including max_lr, min_lr, weight decay, but the trainer still reported that the learning_rate is 0.0.
Environment: transformer==4.12.3, deepspeed==0.5.7
warnings.warn(formatted_warning, FutureWarning)
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0399, 'learning_rate': 0.0, 'epoch': 0.06}
8%|█████████████ | 1999/24128 [1:52:11<20:35:01, 3.35s/it][2021-11-22 19:51:55,198] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 19:51:55,199] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.546767962244255
{'loss': 0.0749, 'learning_rate': 0.0, 'epoch': 0.08}
{'loss': 0.408, 'learning_rate': 0.0, 'epoch': 0.1}
{'loss': 0.0354, 'learning_rate': 0.0, 'epoch': 0.12}
{'loss': 0.0341, 'learning_rate': 0.0, 'epoch': 0.15}
17%|██████████████████████████ | 3999/24128 [3:43:57<18:47:06, 3.36s/it][2021-11-22 21:43:41,103] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 21:43:41,103] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.564911481857864
{'loss': 0.0316, 'learning_rate': 0.0, 'epoch': 0.17}
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.19}
{'loss': 0.035, 'learning_rate': 0.0, 'epoch': 0.21}
{'loss': 0.1423, 'learning_rate': 0.0, 'epoch': 0.23}
25%|███████████████████████████████████████ | 5999/24128 [5:35:43<16:52:01, 3.35s/it][2021-11-22 23:35:26,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 23:35:26,678] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.571203445125207
{'loss': 0.1107, 'learning_rate': 0.0, 'epoch': 0.25}
{'loss': 0.0467, 'learning_rate': 0.0, 'epoch': 0.27}
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.29}
{'loss': 0.0706, 'learning_rate': 0.0, 'epoch': 0.31}
33%|████████████████████████████████████████████████████ | 7999/24128 [7:27:26<15:00:20, 3.35s/it][2021-11-23 01:27:10,465] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 01:27:10,465] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.574953735862689
{'loss': 0.22, 'learning_rate': 0.0, 'epoch': 0.33}
{'loss': 0.0967, 'learning_rate': 0.0, 'epoch': 0.35}
{'loss': 0.0716, 'learning_rate': 0.0, 'epoch': 0.37}
{'loss': 0.1111, 'learning_rate': 0.0, 'epoch': 0.39}
41%|█████████████████████████████████████████████████████████████████ | 9999/24128 [9:19:10<13:10:15, 3.36s/it][2021-11-23 03:18:53,863] [INFO] [logging.py:69:log_dist] [Rank 0] step=10000, skipped=9999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 03:18:53,863] [INFO] [timer.py:181:stop] 0/10000, SamplesPerSec=9.577305314814142
{'loss': 0.2233, 'learning_rate': 0.0, 'epoch': 0.41}
43%|███████████████████████████████████████████████████████████████████▏ | 10397/24128 [9:41:24<12:47:24, 3.35s/it]Traceback (most recent call last):
File "./finetune_trainer.py", line 368, in <module>
main()
File "./finetune_trainer.py", line 305, in main
train_result = trainer.train(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
tr_loss_step = self.training_step(model, inputs)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1865, in training_step
loss = self.deepspeed.backward(loss)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1708, in backward
self.optimizer.backward(loss)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1880, in backward
buf_1 = torch.empty(int(self.reduce_bucket_size),
RuntimeError: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 1; 39.59 GiB total capacity; 36.01 GiB already allocated; 164.94 MiB free; 36.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Also receiving OOM wonder what can I do ?
Hi @tuhinjubcse, I see you've been working with the excellent @stas00 on some of these issues. I finished reading up on the latest with you two in this issue https://github.com/huggingface/transformers/issues/14531.
As Stas mentioned, once this DeepSpeed PR https://github.com/microsoft/DeepSpeed/pull/1453 is merged you should be able to run ZeRO stage 3 w. BF16 support which should help reduce memory and potentially improve throughput. If you want to give it a try before it's merged you can checkout and install the branch via this command: pip install git+https://github.com/jfc4050/DeepSpeed.git@s3-pr
Getting this error which I honestly don't understand