Open jiangix-paper opened 1 year ago
Hello, the related link is as follows: https://github.com/huggingface/accelerate/issues/1707
@jiangix-paper, can you please share your stack trace?
Hello @tjruwase, another user also has experienced this. This is realted to CPU RAM going OOM when loading the ckpt. The stack trace is also shared by them: https://github.com/huggingface/transformers/issues/25027#issuecomment-1648802886
Thanks @pacman100 for the pointing the source. Hi @tjruwase and team, to provide more details: I am able to train LIama-2 7B on EC 2 instance of g5.12xlarge successfully until the loading the best model stage.
For my purpose, I just need it for evaluation / inference rather than resume previously stopped training. In that sense, can I skip the optimizer states loading? as I saw it goes OOM when loading the the optimizer states.
g5.12xlarge
EC2 instance (4 GPU, Total GPU memory 96 GB, vCPUs 48 with 192 GB). cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=<OMIT_AS_NON_IMPORTANT> --master_addr=<OMIT_AS_NON_IMPORTANT> --master_port=<OMIT_AS_NON_IMPORTANT>--enable_each_rank_log=None run_clm.py --deepspeed ds_config.json --model_name_or_path /tmp --train_file /opt/ml/input/data/train --do_train --output_dir /opt/ml/model --num_train_epochs 1 --gradient_accumulation_steps 4 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --logging_steps 10 --warmup_ratio 0.1 --learning_rate 6e-06 --weight_decay 0.2 --seed 10 --max_input_length -1 --validation_split_ratio 0.1 --train_data_split_seed 0 --max_steps -1 --early_stopping_patience 3 --early_stopping_threshold 0.0 --adam_beta1 0.9 --adam_beta2 0.999 --max_grad_norm 1.0 --label_smoothing_factor 0.0 --logging_strategy steps --save_strategy steps --save_steps 10 --dataloader_num_workers 0 --lr_scheduler_type constant_with_warmup --warmup_steps 0 --evaluation_strategy steps --eval_steps 10 --bf16 --instruction_tuned --gradient_checkpointing --save_total_limit 1
I observe it goes OOM with logs as below.
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] bfloat16_enabled ............. True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9090172bf0>
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] communication_data_type ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] curriculum_params_legacy ..... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] data_efficiency_enabled ...... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dataloader_drop_last ......... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] disable_allgather ............ False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dump_state ................... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_verbose ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] elasticity_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_auto_cast ............... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_enabled ................. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] global_rank .................. 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] grad_accum_dtype ............. None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_accumulation_steps .. 2
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_clipping ............ 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] initial_dynamic_scale ........ 1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] load_universal_checkpoint .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] loss_scale ................... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] memory_breakdown ............. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] mics_hierarchial_params_gather False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] mics_shard_size .............. -1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_name ............... adamw
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_params ............. {'lr': 6e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pld_enabled .................. False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pld_params ................... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] prescale_gradients ........... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] scheduler_name ............... WarmupLR
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 6e-06, 'warmup_num_steps': 2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] sparse_attention ............. None
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] steps_per_print .............. inf
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] train_batch_size ............. 16
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 2
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] use_node_local_storage ....... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] wall_clock_breakdown ......... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] world_size ................... 4
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_allow_untested_optimizer False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_enabled ................. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_optimization_stage ...... 3
[2023-07-25 00:30:36,503] [INFO] [config.py:950:print_user_config] json = {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 6e-06,
"betas": [0.9, 0.999],
"eps": 1e-08,
"weight_decay": 0.2
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 6e-06,
"warmup_num_steps": 2
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 1.677722e+07,
"stage3_prefetch_bucket_size": 1.509949e+07,
"stage3_param_persistence_threshold": 4.096000e+04,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 2,
"wall_clock_breakdown": false
}
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >> Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >> Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >> Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >> Total optimization steps = 11
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >> Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >> Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >> Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >> Total optimization steps = 11
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >> Number of trainable parameters = 6,738,448,384
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >> Number of trainable parameters = 6,738,448,384
0%| | 0/11 [00:00<?, ?it/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
07/25/2023 00:31:11 - INFO - __main__ - !!!!!!At this step throughput is 0.45318892143877243
9%|▉ | 1/11 [00:35<05:53, 35.31s/it]
07/25/2023 00:31:42 - INFO - __main__ - !!!!!!At this step throughput is 0.47042510136622717
18%|█▊ | 2/11 [01:05<04:51, 32.37s/it]
07/25/2023 00:32:13 - INFO - __main__ - !!!!!!At this step throughput is 0.47886025282245415
27%|██▋ | 3/11 [01:36<04:14, 31.84s/it]
07/25/2023 00:32:44 - INFO - __main__ - !!!!!!At this step throughput is 0.4844130442539049
36%|███▋ | 4/11 [02:07<03:40, 31.47s/it]
07/25/2023 00:33:15 - INFO - __main__ - !!!!!!At this step throughput is 0.4884299545826904
45%|████▌ | 5/11 [02:38<03:07, 31.24s/it]
07/25/2023 00:33:45 - INFO - __main__ - !!!!!!At this step throughput is 0.4916091094101314
55%|█████▍ | 6/11 [03:09<02:35, 31.02s/it]
07/25/2023 00:34:17 - INFO - __main__ - !!!!!!At this step throughput is 0.49364129923765976
64%|██████▎ | 7/11 [03:41<02:05, 31.42s/it]
07/25/2023 00:34:48 - INFO - __main__ - !!!!!!At this step throughput is 0.4954246781847558
73%|███████▎ | 8/11 [04:12<01:33, 31.16s/it]
07/25/2023 00:35:18 - INFO - __main__ - !!!!!!At this step throughput is 0.4971914292369494
82%|████████▏ | 9/11 [04:41<01:01, 30.68s/it]
07/25/2023 00:35:48 - INFO - __main__ - !!!!!!At this step throughput is 0.49877618579058647
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
{'loss': 1.7188, 'learning_rate': 6e-06, 'epoch': 0.87}
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >> Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >> Batch size = 8
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >> Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >> Batch size = 8
0%| | 0/1 [00:00<?, ?it/s]#033[A
#033[A
{'eval_loss': 1.104188323020935, 'eval_runtime': 3.1127, 'eval_samples_per_second': 6.425, 'eval_steps_per_second': 0.321, 'epoch': 0.87}
91%|█████████ | 10/11 [05:15<00:30, 30.55s/it]
#015100%|██████████| 1/1 [00:00<00:00, 1080.45it/s]
#033[A
#033[A
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[2023-07-25 00:36:15,659] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-07-25 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-25 00:36:15,675] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:37:16,991] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-25 00:37:16,992] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-25 00:37:17,699] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
07/25/2023 00:37:49 - INFO - __main__ - !!!!!!At this step throughput is 0.49004957528181253
100%|██████████| 11/11 [07:12<00:00, 58.13s/it]
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[2023-07-25 00:37:49,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,143] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,151] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,161] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,180] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:38:05,103] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 230
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 231
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 232
[2023-07-25 00:38:11,500] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 233
@jiangix-paper, what happens if you set load_optimizer_states=False
in the load_checkpoint()
call?
@jiangix-paper kindly let me know how it works under @tjruwase 's suggestion, I am also under urgency to make it work with deepspeed
I am experiencing the same issue. Fine-tuning Qwen 14B on 4xA100 80GB works fine, but OOM when trying to load the model from a checkpoint, in particular at the end of training with load_best_model=True
. 2TB of RAM.
Accelerate config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: zero3.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Deepspeed config
{
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params": {
"warmup_num_steps": "auto",
"warmup_type": "linear",
"cos_min_ratio": 0.1,
"total_num_steps": "auto"
}
},
"gradient_accumulation_steps": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Error trace
Traceback (most recent call last):
File "/lustre/home/rolmedo/training-test-task/train_sft_trainer.py", line 324, in <module>
train_result = trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/trainer.py", line 1837, in train
return inner_training_loop(
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/trainer.py", line 2011, in _inner_training_loop
deepspeed_load_checkpoint(
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 432, in deepspeed_load_checkpoint
load_path, _ = deepspeed_engine.load_checkpoint(
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2751, in load_checkpoint
success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2942, in _load_zero_checkpoint
self.optimizer.load_state_dict(state_dict_list=zero_sd_list,
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2564, in load_state_dict
self._rigid_load_state_dict(state_dict_list[dist.get_rank(group=self.dp_process_group)],
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2491, in _rigid_load_state_dict
self.optimizer.load_state_dict(state_dict[OPTIMIZER_STATE_DICT])
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 760, in load_state_dict
state[param] = _cast(param, v, param_id=k, param_groups=state_dict['param_groups'])
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 747, in _cast
return {k: _cast(param, v, param_id=param_id, param_groups=param_groups, key=k) for k, v in value.items()}
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 747, in <dictcomp>
return {k: _cast(param, v, param_id=param_id, param_groups=param_groups, key=k) for k, v in value.items()}
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 745, in _cast
return Optimizer._process_value_according_to_param_policy(param, value, param_id, param_groups, key)
File "/lustre/home/rolmedo/axo121/lib/python3.10/site-packages/torch/optim/optimizer.py", line 627, in _process_value_according_to_param_policy
return value.to(dtype=param.dtype, device=param.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 61.69 MiB is free. Including non-PyTorch memory, this process has 79.26 GiB memory in use. Of the allocated memory 76.42 GiB is allocated by PyTorch, and 1.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I pretrain a 27B model from scratch with deepspeed stage 3 (no cpu offload) in 8*80G A100, batch size of each gpu is 2. And I use 'accelerator.save_state()' to save the optimizer/lr scheduler. The saved files are as follows:
When I want to resume from above saved checkpoint, I use the following code: model, train_dataloader, optimizer, lr_scheduler = accelerator.prepare( model, train_dataloader, optimizer, lr_scheduler ) accelerator.load_state('/pretrained_model/xxx') But I got the error of CUDA out of memory. Can you help me? Thanks a lot.
Expected behavior
I expected it cannot caused the cuda out of memory.