microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.77k stars 3.96k forks source link

Error while saving T5-11B checkpoint #1574

Open tuhinjubcse opened 2 years ago

tuhinjubcse commented 2 years ago

Getting this error which I honestly don't understand

[INFO|trainer.py:1995] 2021-11-19 01:06:24,979 >> Saving model checkpoint to /local/nlp/temp/poetryT5-11B_new/checkpoint-21
[INFO|configuration_utils.py:417] 2021-11-19 01:06:24,980 >> Configuration saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/config.json
[INFO|modeling_utils.py:1058] 2021-11-19 01:07:05,343 >> Model weights saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/pytorch_model.bin
[INFO|tokenization_utils_base.py:2034] 2021-11-19 01:07:05,345 >> tokenizer config file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/tokenizer_config.json
[INFO|tokenization_utils_base.py:2040] 2021-11-19 01:07:05,345 >> Special tokens file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2021-11-19 01:07:05,380 >> Copy vocab file to /local/nlp/temp/poetryT5-11B_new/checkpoint-21/spiece.model
[2021-11-19 01:07:05,399] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /local/nlp/temp/poetryT5-11B_new/checkpoint-21/global_step21/mp_rank_00_model_states.pt
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
    main()
  File "./finetune_trainer.py", line 305, in main
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
      File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
        self._save_checkpoint(model, trial, metrics=metrics)self._save_checkpoint(model, trial, metrics=metrics)

  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
tjruwase commented 2 years ago

@tuhinjubcse, thanks for reporting this error. Can you please share how to repro on our side? Thanks!

tuhinjubcse commented 2 years ago

My script from transformers repo

export BS=8;
PYTHONPATH=../../src
USE_TF=0

deepspeed --num_gpus=3 ./finetune_trainer.py \
 --data_dir /home/tuhin.chakr/gpt3/poetrynew \
 --output_dir /local/nlp/temp/poetryT5-11B_new \
 --model_name_or_path t5-11b \
 --do_train \
 --task translation \
 --max_source_length 128 \
 --max_target_length 128 \
 --save_strategy=epoch \
 --num_train_epochs 1 \
 --per_device_train_batch_size $BS \
 --adafactor \
 --learning_rate 1e-3 \
 --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json \
 --fp16
~          

My deepspeed config

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "train_batch_size": 24,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Name: deepspeed Version: 0.5.6

Name: torch Version: 1.10.0

Name: transformers Version: 4.12.2

jeffra commented 2 years ago

I suspect this pending PR might fix this issue, can you give it a try? There’s one fix that needs to be applied before we can merge but that should be unrelated to your issue I believe.

https://github.com/microsoft/DeepSpeed/pull/1525

tuhinjubcse commented 2 years ago

@jeffra Do you know what I should do exactly? Do I need to make any changes in deepspeed code ?

jeffra commented 2 years ago

To give it a try you should be able to reinstall deepspeed but specifically from this branch: https://github.com/microsoft/DeepSpeed/tree/zero-ckpt-cpu-issue

You shouldn’t need any code changes on your side.

You should also be able to pip install this version via: pip install git+https://github.com/microsoft/deepspeed.git@zero-ckpt-cpu-issue

tuhinjubcse commented 2 years ago

Many thanks @jeffra . This worked

I have one small question my LR was mentioned in my script as 1e-3

json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 0.001, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true
    }, 
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 8, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "wall_clock_breakdown": false, 
    "zero_allow_untested_optimizer": true
}

When my training loss is printed it shows learning_rate as 0.0, Do you know why ? Is this because of WarmUpLR ?

{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}

Ririkoo commented 2 years ago

Many thanks @jeffra . This worked

I have one small question my LR was mentioned in my script as 1e-3

json = {
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 0.001, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 2.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 2.000000e+08, 
        "contiguous_gradients": true
    }, 
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 8, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "wall_clock_breakdown": false, 
    "zero_allow_untested_optimizer": true
}

When my training loss is printed it shows learning_rate as 0.0, Do you know why ? Is this because of WarmUpLR ?

{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02} {'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}

@tuhinjubcse . Same problem happened when I was fine-tuning the T5-3B model using huggingface. I tried to adjust the hyper-parameters including max_lr, min_lr, weight decay, but the trainer still reported that the learning_rate is 0.0.

Environment: transformer==4.12.3, deepspeed==0.5.7

tuhinjubcse commented 2 years ago
  warnings.warn(formatted_warning, FutureWarning)
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}                                                                                                                                                       
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}                                                                                                                                                       
{'loss': 0.0399, 'learning_rate': 0.0, 'epoch': 0.06}                                                                                                                                                       
  8%|█████████████                                                                                                                                                | 1999/24128 [1:52:11<20:35:01,  3.35s/it][2021-11-22 19:51:55,198] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 19:51:55,199] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.546767962244255
{'loss': 0.0749, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                                                                                       
{'loss': 0.408, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                                                                         
{'loss': 0.0354, 'learning_rate': 0.0, 'epoch': 0.12}                                                                                                                                                       
{'loss': 0.0341, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                                                                                       
 17%|██████████████████████████                                                                                                                                   | 3999/24128 [3:43:57<18:47:06,  3.36s/it][2021-11-22 21:43:41,103] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 21:43:41,103] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.564911481857864
{'loss': 0.0316, 'learning_rate': 0.0, 'epoch': 0.17}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.19}                                                                                                                                                       
{'loss': 0.035, 'learning_rate': 0.0, 'epoch': 0.21}                                                                                                                                                        
{'loss': 0.1423, 'learning_rate': 0.0, 'epoch': 0.23}                                                                                                                                                       
 25%|███████████████████████████████████████                                                                                                                      | 5999/24128 [5:35:43<16:52:01,  3.35s/it][2021-11-22 23:35:26,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 23:35:26,678] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.571203445125207
{'loss': 0.1107, 'learning_rate': 0.0, 'epoch': 0.25}                                                                                                                                                       
{'loss': 0.0467, 'learning_rate': 0.0, 'epoch': 0.27}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.29}                                                                                                                                                       
{'loss': 0.0706, 'learning_rate': 0.0, 'epoch': 0.31}                                                                                                                                                       
 33%|████████████████████████████████████████████████████                                                                                                         | 7999/24128 [7:27:26<15:00:20,  3.35s/it][2021-11-23 01:27:10,465] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 01:27:10,465] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.574953735862689
{'loss': 0.22, 'learning_rate': 0.0, 'epoch': 0.33}                                                                                                                                                         
{'loss': 0.0967, 'learning_rate': 0.0, 'epoch': 0.35}                                                                                                                                                       
{'loss': 0.0716, 'learning_rate': 0.0, 'epoch': 0.37}                                                                                                                                                       
{'loss': 0.1111, 'learning_rate': 0.0, 'epoch': 0.39}                                                                                                                                                       
 41%|█████████████████████████████████████████████████████████████████                                                                                            | 9999/24128 [9:19:10<13:10:15,  3.36s/it][2021-11-23 03:18:53,863] [INFO] [logging.py:69:log_dist] [Rank 0] step=10000, skipped=9999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 03:18:53,863] [INFO] [timer.py:181:stop] 0/10000, SamplesPerSec=9.577305314814142
{'loss': 0.2233, 'learning_rate': 0.0, 'epoch': 0.41}                                                                                                                                                       
 43%|███████████████████████████████████████████████████████████████████▏                                                                                        | 10397/24128 [9:41:24<12:47:24,  3.35s/it]Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1865, in training_step
    loss = self.deepspeed.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1708, in backward
    self.optimizer.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1880, in backward
    buf_1 = torch.empty(int(self.reduce_bucket_size),
RuntimeError: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 1; 39.59 GiB total capacity; 36.01 GiB already allocated; 164.94 MiB free; 36.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also receiving OOM wonder what can I do ?

jeffra commented 2 years ago

Hi @tuhinjubcse, I see you've been working with the excellent @stas00 on some of these issues. I finished reading up on the latest with you two in this issue https://github.com/huggingface/transformers/issues/14531.

As Stas mentioned, once this DeepSpeed PR https://github.com/microsoft/DeepSpeed/pull/1453 is merged you should be able to run ZeRO stage 3 w. BF16 support which should help reduce memory and potentially improve throughput. If you want to give it a try before it's merged you can checkout and install the branch via this command: pip install git+https://github.com/jfc4050/DeepSpeed.git@s3-pr