microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[BUG] model.load_checkpoint out of memory #3938

Open jiangix-paper opened 1 year ago

jiangix-paper commented 1 year ago

System Info

accelerate 0.20.3
python 3.10
numpy 1.24.3
torch 2.0.1
accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

Tasks

Reproduction

I pretrain a 27B model from scratch with deepspeed stage 3 (no cpu offload) in 8*80G A100, batch size of each gpu is 2. And I use 'accelerator.save_state()' to save the optimizer/lr scheduler. The saved files are as follows: image

When I want to resume from above saved checkpoint, I use the following code: model, train_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, optimizer, lr_scheduler ) accelerator.load_state('/pretrained_model/xxx') But I got the error of CUDA out of memory. Can you help me? Thanks a lot.

Expected behavior

I expected it cannot caused the cuda out of memory.

jiangix-paper commented 1 year ago

Hello, the related link is as follows: https://github.com/huggingface/accelerate/issues/1707

tjruwase commented 1 year ago

@jiangix-paper, can you please share your stack trace?

pacman100 commented 1 year ago

Hello @tjruwase, another user also has experienced this. This is realted to CPU RAM going OOM when loading the ckpt. The stack trace is also shared by them: https://github.com/huggingface/transformers/issues/25027#issuecomment-1648802886

Neo9061 commented 1 year ago

Thanks @pacman100 for the pointing the source. Hi @tjruwase and team, to provide more details: I am able to train LIama-2 7B on EC 2 instance of g5.12xlarge successfully until the loading the best model stage.

For my purpose, I just need it for evaluation / inference rather than resume previously stopped training. In that sense, can I skip the optimizer states loading? as I saw it goes OOM when loading the the optimizer states.

I observe it goes OOM with logs as below.

[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9090172bf0>
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   communication_data_type ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   disable_allgather ............ False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dump_state ................... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   global_rank .................. 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 2
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_clipping ............ 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_name ............... adamw
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_params ............. {'lr': 6e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pld_enabled .................. False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pld_params ................... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   scheduler_name ............... WarmupLR
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 6e-06, 'warmup_num_steps': 2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   sparse_attention ............. None
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   steps_per_print .............. inf
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   train_batch_size ............. 16
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  2
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   use_node_local_storage ....... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   world_size ................... 4
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_enabled ................. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2023-07-25 00:30:36,503] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 12, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 6e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.2
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 6e-06, 
            "warmup_num_steps": 2
        }
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": false
        }, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": false
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 1.677722e+07, 
        "stage3_prefetch_bucket_size": 1.509949e+07, 
        "stage3_param_persistence_threshold": 4.096000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_fp16_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 2, 
    "wall_clock_breakdown": false
}
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >>   Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >>   Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >>   Total optimization steps = 11
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >>   Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >>   Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >>   Total optimization steps = 11
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >>   Number of trainable parameters = 6,738,448,384
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >>   Number of trainable parameters = 6,738,448,384
0%|          | 0/11 [00:00<?, ?it/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
07/25/2023 00:31:11 - INFO - __main__ -   !!!!!!At this step throughput is 0.45318892143877243
9%|▉         | 1/11 [00:35<05:53, 35.31s/it]
07/25/2023 00:31:42 - INFO - __main__ -   !!!!!!At this step throughput is 0.47042510136622717
18%|█▊        | 2/11 [01:05<04:51, 32.37s/it]
07/25/2023 00:32:13 - INFO - __main__ -   !!!!!!At this step throughput is 0.47886025282245415
27%|██▋       | 3/11 [01:36<04:14, 31.84s/it]
07/25/2023 00:32:44 - INFO - __main__ -   !!!!!!At this step throughput is 0.4844130442539049
36%|███▋      | 4/11 [02:07<03:40, 31.47s/it]
07/25/2023 00:33:15 - INFO - __main__ -   !!!!!!At this step throughput is 0.4884299545826904
45%|████▌     | 5/11 [02:38<03:07, 31.24s/it]
07/25/2023 00:33:45 - INFO - __main__ -   !!!!!!At this step throughput is 0.4916091094101314
55%|█████▍    | 6/11 [03:09<02:35, 31.02s/it]
07/25/2023 00:34:17 - INFO - __main__ -   !!!!!!At this step throughput is 0.49364129923765976
64%|██████▎   | 7/11 [03:41<02:05, 31.42s/it]
07/25/2023 00:34:48 - INFO - __main__ -   !!!!!!At this step throughput is 0.4954246781847558
73%|███████▎  | 8/11 [04:12<01:33, 31.16s/it]
07/25/2023 00:35:18 - INFO - __main__ -   !!!!!!At this step throughput is 0.4971914292369494
82%|████████▏ | 9/11 [04:41<01:01, 30.68s/it]
07/25/2023 00:35:48 - INFO - __main__ -   !!!!!!At this step throughput is 0.49877618579058647
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
{'loss': 1.7188, 'learning_rate': 6e-06, 'epoch': 0.87}
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >>   Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >>   Batch size = 8
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >>   Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >>   Batch size = 8
0%|          | 0/1 [00:00<?, ?it/s]#033[A
#033[A
{'eval_loss': 1.104188323020935, 'eval_runtime': 3.1127, 'eval_samples_per_second': 6.425, 'eval_steps_per_second': 0.321, 'epoch': 0.87}
91%|█████████ | 10/11 [05:15<00:30, 30.55s/it]
#015100%|██████████| 1/1 [00:00<00:00, 1080.45it/s]
#033[A
#033[A
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[2023-07-25 00:36:15,659] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-25 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-25 00:36:15,675] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:37:16,991] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-25 00:37:16,992] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-25 00:37:17,699] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
07/25/2023 00:37:49 - INFO - __main__ -   !!!!!!At this step throughput is 0.49004957528181253
100%|██████████| 11/11 [07:12<00:00, 58.13s/it]
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[2023-07-25 00:37:49,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,143] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,151] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,161] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,180] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:38:05,103] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 230
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 231
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 232
[2023-07-25 00:38:11,500] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 233
tjruwase commented 1 year ago

@jiangix-paper, what happens if you set load_optimizer_states=False in the load_checkpoint() call?

Neo9061 commented 1 year ago

@jiangix-paper kindly let me know how it works under @tjruwase 's suggestion, I am also under urgency to make it work with deepspeed

RicardoDominguez commented 5 months ago

I am experiencing the same issue. Fine-tuning Qwen 14B on 4xA100 80GB works fine, but OOM when trying to load the model from a checkpoint, in particular at the end of training with load_best_model=True. 2TB of RAM.

Accelerate config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: zero3.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Deepspeed config

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupCosineLR",
    "params": {
      "warmup_num_steps": "auto",
      "warmup_type": "linear",
      "cos_min_ratio": 0.1,
      "total_num_steps": "auto"
    }
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Error trace

Traceback (most recent call last):
  File "/lustre/home/rolmedo/training-test-task/train_sft_trainer.py", line 324, in <module>
    train_result = trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in train
    return inner_training_loop(
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/trainer.py", line 2011, in _inner_training_loop
    deepspeed_load_checkpoint(
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 432, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2751, in load_checkpoint
    success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2942, in _load_zero_checkpoint
    self.optimizer.load_state_dict(state_dict_list=zero_sd_list,
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2564, in load_state_dict
    self._rigid_load_state_dict(state_dict_list[dist.get_rank(group=self.dp_process_group)],
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2491, in _rigid_load_state_dict
    self.optimizer.load_state_dict(state_dict[OPTIMIZER_STATE_DICT])
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 760, in load_state_dict
    state[param] = _cast(param, v, param_id=k, param_groups=state_dict['param_groups'])
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 747, in _cast
    return {k: _cast(param, v, param_id=param_id, param_groups=param_groups, key=k) for k, v in value.items()}
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 747, in <dictcomp>
    return {k: _cast(param, v, param_id=param_id, param_groups=param_groups, key=k) for k, v in value.items()}
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 745, in _cast
    return Optimizer._process_value_according_to_param_policy(param, value, param_id, param_groups, key)
  File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 627, in _process_value_according_to_param_policy
    return value.to(dtype=param.dtype, device=param.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 61.69 MiB is free. Including non-PyTorch memory, this process has 79.26 GiB memory in use. Of the allocated memory 76.42 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF