LIama-2 7B fine-tuning with DeepSpeed OOM error during loading the best model at end when `load_best_model_at_end` specified as True.

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Apache License 2.0

132.29k stars 26.35k forks source link

System Info

Hi Community!

I am using run_clm.py with deepspeed to fine-tune LIama 7B on g5.12xlarge EC2 instance (4 GPU, Total GPU memory 96 GB, vCPUs 48 with 192 GB).

Transformer version: 4.28.1
DeepSpeed version: 0.10.0 (lastest)
Instance: g5.12xlarge EC2 instance (4 GPU, Total GPU memory 96 GB, vCPUs 48 with 192 GB).
DeepSpeed file config: ds_config.pdf
Invoking command: cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=<OMIT_AS_NON_IMPORTANT> --master_addr=<OMIT_AS_NON_IMPORTANT> --master_port=<OMIT_AS_NON_IMPORTANT>--enable_each_rank_log=None run_clm.py --deepspeed ds_config.json --model_name_or_path /tmp --train_file /opt/ml/input/data/train --do_train --output_dir /opt/ml/model --num_train_epochs 1 --gradient_accumulation_steps 4 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --logging_steps 10 --warmup_ratio 0.1 --learning_rate 6e-06 --weight_decay 0.2 --seed 10 --max_input_length -1 --validation_split_ratio 0.1 --train_data_split_seed 0 --max_steps -1 --early_stopping_patience 3 --early_stopping_threshold 0.0 --adam_beta1 0.9 --adam_beta2 0.999 --max_grad_norm 1.0 --label_smoothing_factor 0.0 --logging_strategy steps --save_strategy steps --save_steps 10 --dataloader_num_workers 0 --lr_scheduler_type constant_with_warmup --warmup_steps 0 --evaluation_strategy steps --eval_steps 10 --bf16 --instruction_tuned --gradient_checkpointing --save_total_limit 1

The model is able to be trained successfully until the very end step of loading the best model. Because of load_best_model_at_end argument being True, when trainer.py uses DeepSpeed engine to load the model, it goes OOM.

Who can help?

@sgugger @pacman100 @ArthurZucker and @younesbelkada

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

During the stage of load the best model, I saw weird print out of initialization of deepspeed log that appears at the beginning of the training. Then I verified and saw this line in trainer.py. Then I thought this OOM is due to unnecessary usage of Deep speed init function (as also indicated by the comment above the code).

Next, I uses latest version of Transformers 4.31.0 as I saw it no longer uses deepspeed init to load the best model (line and deepspeed loading function). Then I hit LIama 2 configuration bug. See below. I don't know why during loading the best model, this line of deepspeed is not triggered but this line did.

Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1934] 2023-07-24 02:37:09,873 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2093] 2023-07-24 02:37:09,889 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.4037604331970215).
[INFO|trainer.py:2093] 2023-07-24 02:37:09,889 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.4037604331970215).
Traceback (most recent call last):
  File "/opt/ml/code/run_clm.py", line 229, in <module>
main()
  File "/opt/ml/code/run_clm.py", line 178, in main
train_result = trainer.train()  # load model/optimizer/scheduler states
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1944, in _inner_training_loop
self._load_best_model()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2168, in _load_best_model
load_result = load_sharded_checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 431, in load_sharded_checkpoint
model.load_state_dict(state_dict, strict=False)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
#011size mismatch for model.layers.24.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
[2023-07-24 02:37:16,892] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 164
[2023-07-24 02:37:20,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 165
Traceback (most recent call last):
  File "/opt/ml/code/run_clm.py", line 229, in <module>
main()
  File "/opt/ml/code/run_clm.py", line 178, in main
train_result = trainer.train()  # load model/optimizer/scheduler states
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1944, in _inner_training_loop
self._load_best_model()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2168, in _load_best_model
load_result = load_sharded_checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 431, in load_sharded_checkpoint
model.load_state_dict(state_dict, strict=False)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
#011size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).

Then, I thought while I am waiting for HF to fix the configuration issue with LIama 2. I can uses the latest code of loading the best model from transformers 4.31.0 and apply it to the code withtransformers 4.28.1.

Thus I disable load_best_model_at_end, and try to load it after Trainer.train() with following code.

train_result = trainer.train()

checkpoint_dirs = sorted(glob.glob(f"/opt/ml/model/checkpoint-*"))
checkpoint_path = checkpoint_dirs[0] # this is because I set total_save_limit as 1
load_path, _ = trainer.model_wrapped.load_checkpoint(
           checkpoint_path, load_optimizer_states=False, load_lr_scheduler_states=False
)

trainer.save_model()

I hit OOM when I specified load_optimizer_states and load_lr_scheduler_states being True. Then I thought since the model I save is used for evaluation/inference only rather than resuming training from the checkpoints. Thus I don't need optimzer and lr scheduler. However, when I set them as False, I still hit the error.

Please advise what you think on this issue. THX!

Expected behavior

I expect the best model to be loaded without OOM error as the model can be trained successfully before hitting the final saving step.

export WANDB_DISABLED="true" export CUDA_VISIBLE_DEVICES="0,1" cd transformers deepspeed --num_nodes 1 --num_gpus 2 --master_port 10999 /home/sourab/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --do_eval --max_train_samples 30 --max_eval_samples 10 --block_size 512 --overwrite_output_dir --gradient_checkpointing --save_strategy "steps" --evaluation_strategy "steps" --eval_steps 10 --save_steps 10 --load_best_model_at_end --output_dir /tmp/test-clm --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json

2023-07-24 10:39:47,947] [INFO] [config.py:950:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": false }, "optimizer": { "type": "AdamW", "params": { "lr": 5e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-05, "warmup_num_steps": 0 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": 5.898240e+05, "stage3_prefetch_bucket_size": 5.308416e+05, "stage3_param_persistence_threshold": 7.680000e+03, "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 2, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } [INFO|trainer.py:1682] 2023-07-24 10:39:47,947 >> ***** Running training ***** [INFO|trainer.py:1683] 2023-07-24 10:39:47,947 >> Num examples = 30 [INFO|trainer.py:1684] 2023-07-24 10:39:47,947 >> Num Epochs = 3 [INFO|trainer.py:1685] 2023-07-24 10:39:47,947 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1688] 2023-07-24 10:39:47,947 >> Total train batch size (w. parallel, distributed & accumulation) = 2 [INFO|trainer.py:1689] 2023-07-24 10:39:47,947 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1690] 2023-07-24 10:39:47,947 >> Total optimization steps = 45 [INFO|trainer.py:1691] 2023-07-24 10:39:47,947 >> Number of trainable parameters = 124,439,808 0%| | 0/45 [00:00<?, ?it/s][WARNING|logging.py:295] 2023-07-24 10:39:48,027 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... [WARNING|logging.py:295] 2023-07-24 10:39:48,027 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... 22%|████████████████████ | 10/45 [00:05<00:15, 2.27it/s][INFO|trainer.py:3081] 2023-07-24 10:39:53,150 >> ***** Running Evaluation ***** [INFO|trainer.py:3083] 2023-07-24 10:39:53,150 >> Num examples = 10 [INFO|trainer.py:3086] 2023-07-24 10:39:53,151 >> Batch size = 1 {'eval_loss': 3.356262683868408, 'eval_accuracy': 0.3947162426614481, 'eval_runtime': 0.5527, 'eval_samples_per_second': 18.092, 'eval_steps_per_second': 9.046, 'epoch': 0.67} 22%|████████████████████ | 10/45 [00:05<00:15, 2.27it/s[INFO|trainer.py:2807] 2023-07-24 10:39:53,991 >> Saving model checkpoint to /tmp/test-clm/checkpoint-10 [INFO|configuration_utils.py:458] 2023-07-24 10:39:53,991 >> Configuration saved in /tmp/test-clm/checkpoint-10/config.json [INFO|configuration_utils.py:379] 2023-07-24 10:39:53,992 >> Configuration saved in /tmp/test-clm/checkpoint-10/generation_config.json [INFO|modeling_utils.py:1855] 2023-07-24 10:39:54,649 >> Model weights saved in /tmp/test-clm/checkpoint-10/pytorch_model.bin [INFO|tokenization_utils_base.py:2210] 2023-07-24 10:39:54,650 >> tokenizer config file saved in /tmp/test-clm/checkpoint-10/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-24 10:39:54,650 >> Special tokens file saved in /tmp/test-clm/checkpoint-10/special_tokens_map.json [2023-07-24 10:39:54,735] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved! /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-07-24 10:39:54,738] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-07-24 10:39:54,738] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:39:54,744] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:39:54,744] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-24 10:39:57,379] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-24 10:39:57,379] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-07-24 10:39:57,386] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now! 44%|████████████████████████████████████████ | 20/45 [00:13<00:12, 2.07it/s][INFO|trainer.py:3081] 2023-07-24 10:40:01,597 >> ***** Running Evaluation ***** [INFO|trainer.py:3083] 2023-07-24 10:40:01,598 >> Num examples = 10 [INFO|trainer.py:3086] 2023-07-24 10:40:01,598 >> Batch size = 1 {'eval_loss': 3.3019282817840576, 'eval_accuracy': 0.40371819960861055, 'eval_runtime': 0.3621, 'eval_samples_per_second': 27.618, 'eval_steps_per_second': 13.809, 'epoch': 1.33} 44%|████████████████████████████████████████ | 20/45 [00:14<00:12, 2.07it/s[INFO|trainer.py:2807] 2023-07-24 10:40:02,302 >> Saving model checkpoint to /tmp/test-clm/checkpoint-20 [INFO|configuration_utils.py:458] 2023-07-24 10:40:02,303 >> Configuration saved in /tmp/test-clm/checkpoint-20/config.json [INFO|configuration_utils.py:379] 2023-07-24 10:40:02,303 >> Configuration saved in /tmp/test-clm/checkpoint-20/generation_config.json [INFO|modeling_utils.py:1855] 2023-07-24 10:40:02,971 >> Model weights saved in /tmp/test-clm/checkpoint-20/pytorch_model.bin [INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:02,971 >> tokenizer config file saved in /tmp/test-clm/checkpoint-20/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:02,972 >> Special tokens file saved in /tmp/test-clm/checkpoint-20/special_tokens_map.json [2023-07-24 10:40:03,063] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step20 is about to be saved! /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-07-24 10:40:03,066] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-07-24 10:40:03,066] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:40:03,080] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:40:03,081] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-24 10:40:06,196] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-24 10:40:06,197] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-07-24 10:40:06,204] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step20 is ready now! 67%|████████████████████████████████████████████████████████████ | 30/45 [00:22<00:07, 2.01it/s][INFO|trainer.py:3081] 2023-07-24 10:40:10,531 >> ***** Running Evaluation ***** [INFO|trainer.py:3083] 2023-07-24 10:40:10,531 >> Num examples = 10 [INFO|trainer.py:3086] 2023-07-24 10:40:10,531 >> Batch size = 1 {'eval_loss': 3.2902770042419434, 'eval_accuracy': 0.40332681017612526, 'eval_runtime': 0.4135, 'eval_samples_per_second': 24.186, 'eval_steps_per_second': 12.093, 'epoch': 2.0} 67%|████████████████████████████████████████████████████████████ | 30/45 [00:22<00:07, 2.01it/s[INFO|trainer.py:2807] 2023-07-24 10:40:11,199 >> Saving model checkpoint to /tmp/test-clm/checkpoint-30 [INFO|configuration_utils.py:458] 2023-07-24 10:40:11,200 >> Configuration saved in /tmp/test-clm/checkpoint-30/config.json [INFO|configuration_utils.py:379] 2023-07-24 10:40:11,200 >> Configuration saved in /tmp/test-clm/checkpoint-30/generation_config.json [INFO|modeling_utils.py:1855] 2023-07-24 10:40:12,098 >> Model weights saved in /tmp/test-clm/checkpoint-30/pytorch_model.bin [INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:12,098 >> tokenizer config file saved in /tmp/test-clm/checkpoint-30/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:12,098 >> Special tokens file saved in /tmp/test-clm/checkpoint-30/special_tokens_map.json [2023-07-24 10:40:12,188] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step30 is about to be saved! /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-07-24 10:40:12,191] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-07-24 10:40:12,191] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:40:12,197] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:40:12,198] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-24 10:40:15,492] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-24 10:40:15,492] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-07-24 10:40:15,499] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step30 is ready now! 89%|████████████████████████████████████████████████████████████████████████████████ | 40/45 [00:31<00:02, 2.02it/s][INFO|trainer.py:3081] 2023-07-24 10:40:19,832 >> ***** Running Evaluation ***** [INFO|trainer.py:3083] 2023-07-24 10:40:19,832 >> Num examples = 10 [INFO|trainer.py:3086] 2023-07-24 10:40:19,832 >> Batch size = 1 {'eval_loss': 3.3038055896759033, 'eval_accuracy': 0.40136986301369865, 'eval_runtime': 0.4144, 'eval_samples_per_second': 24.13, 'eval_steps_per_second': 12.065, 'epoch': 2.67} 89%|████████████████████████████████████████████████████████████████████████████████ | 40/45 [00:32<00:02, 2.02it/s[INFO|trainer.py:2807] 2023-07-24 10:40:20,497 >> Saving model checkpoint to /tmp/test-clm/checkpoint-40 [INFO|configuration_utils.py:458] 2023-07-24 10:40:20,497 >> Configuration saved in /tmp/test-clm/checkpoint-40/config.json [INFO|configuration_utils.py:379] 2023-07-24 10:40:20,498 >> Configuration saved in /tmp/test-clm/checkpoint-40/generation_config.json [INFO|modeling_utils.py:1855] 2023-07-24 10:40:21,169 >> Model weights saved in /tmp/test-clm/checkpoint-40/pytorch_model.bin [INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:21,169 >> tokenizer config file saved in /tmp/test-clm/checkpoint-40/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:21,169 >> Special tokens file saved in /tmp/test-clm/checkpoint-40/special_tokens_map.json [2023-07-24 10:40:21,259] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step40 is about to be saved! /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-07-24 10:40:21,262] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-07-24 10:40:21,262] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:40:21,268] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:40:21,268] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-24 10:40:23,964] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-24 10:40:23,964] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-07-24 10:40:23,971] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step40 is ready now! 100%|██████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:38<00:00, 1.37it/s][INFO|trainer.py:1930] 2023-07-24 10:40:26,063 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:2089] 2023-07-24 10:40:26,063 >> Loading best model from /tmp/test-clm/checkpoint-30 (score: 3.2902770042419434). [INFO|deepspeed.py:381] 2023-07-24 10:40:26,063 >> Attempting to resume from /tmp/test-clm/checkpoint-30 [2023-07-24 10:40:26,073] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:40:26,077] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:40:26,078] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-24 10:40:26,082] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-24 10:40:26,086] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-24 10:40:26,479] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-24 10:40:26,479] [INFO] [engine.py:2865:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0 [2023-07-24 10:40:26,605] [INFO] [engine.py:2815:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0 {'train_runtime': 38.7307, 'train_samples_per_second': 2.324, 'train_steps_per_second': 1.162, 'train_loss': 3.3458041720920138, 'epoch': 3.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:38<00:00, 1.16it/s] [INFO|trainer.py:2807] 2023-07-24 10:40:26,966 >> Saving model checkpoint to /tmp/test-clm [INFO|configuration_utils.py:458] 2023-07-24 10:40:26,967 >> Configuration saved in /tmp/test-clm/config.json [INFO|configuration_utils.py:379] 2023-07-24 10:40:26,967 >> Configuration saved in /tmp/test-clm/generation_config.json [INFO|modeling_utils.py:1855] 2023-07-24 10:40:28,333 >> Model weights saved in /tmp/test-clm/pytorch_model.bin [INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:28,333 >> tokenizer config file saved in /tmp/test-clm/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:28,333 >> Special tokens file saved in /tmp/test-clm/special_tokens_map.json ***** train metrics ***** epoch = 3.0 train_loss = 3.3458 train_runtime = 0:00:38.73 train_samples = 30 train_samples_per_second = 2.324 train_steps_per_second = 1.162 07/24/2023 10:40:28 - INFO - __main__ - *** Evaluate *** [INFO|trainer.py:3081] 2023-07-24 10:40:28,418 >> ***** Running Evaluation ***** [INFO|trainer.py:3083] 2023-07-24 10:40:28,418 >> Num examples = 10 [INFO|trainer.py:3086] 2023-07-24 10:40:28,418 >> Batch size = 1 100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15.77it/s] ***** eval metrics ***** epoch = 3.0 eval_accuracy = 0.4033 eval_loss = 3.2903 eval_runtime = 0:00:00.38 eval_samples = 10 eval_samples_per_second = 26.017 eval_steps_per_second = 13.009 perplexity = 26.8503 [2023-07-24 10:40:30,989] [INFO] [launch.py:347:main] Process 1140775 exits successfully. [2023-07-24 10:40:31,991] [INFO] [launch.py:347:main] Process 1140774 exits successfully.

train_result = trainer.train() checkpoint_dirs = sorted(glob.glob(f"/opt/ml/model/checkpoint-*")) checkpoint_path = checkpoint_dirs[0] # this is because I set total_save_limit as 1 load_path, _ = trainer.model_wrapped.load_checkpoint( checkpoint_path, load_optimizer_states=False, load_lr_scheduler_states=False ) trainer.save_model()

[2023-07-25 00:30:36,502] [INFO] [config.py:964:print] bfloat16_enabled ............. True [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9090172bf0> [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] communication_data_type ...... None [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] disable_allgather ............ False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dump_state ................... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... None [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_auto_cast ............... None [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_enabled ................. False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] global_rank .................. 0 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_accumulation_steps .. 2 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_clipping ............ 1.0 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] initial_dynamic_scale ........ 1 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] loss_scale ................... 1.0 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-07-25 00:30:36,502] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_name ............... adamw [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] optimizer_params ............. {'lr': 6e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.2} [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pld_enabled .................. False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] pld_params ................... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] scheduler_name ............... WarmupLR [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 6e-06, 'warmup_num_steps': 2} [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] sparse_attention ............. None [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] steps_per_print .............. inf [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] train_batch_size ............. 16 [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 2 [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] world_size ................... 4 [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_allow_untested_optimizer False [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_enabled ................. True [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-07-25 00:30:36,503] [INFO] [config.py:964:print] zero_optimization_stage ...... 3 [2023-07-25 00:30:36,503] [INFO] [config.py:950:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 6e-06, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.2 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 6e-06, "warmup_num_steps": 2 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": 1.677722e+07, "stage3_prefetch_bucket_size": 1.509949e+07, "stage3_param_persistence_threshold": 4.096000e+04, "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 2, "wall_clock_breakdown": false } [INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training ***** [INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >> Num examples = 180 [INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >> Num Epochs = 1 [INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >> Instantaneous batch size per device = 2 [INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >> Total optimization steps = 11 [INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training ***** [INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >> Num examples = 180 [INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >> Num Epochs = 1 [INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >> Instantaneous batch size per device = 2 [INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >> Total optimization steps = 11 [INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >> Number of trainable parameters = 6,738,448,384 [INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >> Number of trainable parameters = 6,738,448,384 0%| | 0/11 [00:00<?, ?it/s] You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. [WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. [WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. 07/25/2023 00:31:11 - INFO - __main__ - !!!!!!At this step throughput is 0.45318892143877243 9%|▉ | 1/11 [00:35<05:53, 35.31s/it] 07/25/2023 00:31:42 - INFO - __main__ - !!!!!!At this step throughput is 0.47042510136622717 18%|█▊ | 2/11 [01:05<04:51, 32.37s/it] 07/25/2023 00:32:13 - INFO - __main__ - !!!!!!At this step throughput is 0.47886025282245415 27%|██▋ | 3/11 [01:36<04:14, 31.84s/it] 07/25/2023 00:32:44 - INFO - __main__ - !!!!!!At this step throughput is 0.4844130442539049 36%|███▋ | 4/11 [02:07<03:40, 31.47s/it] 07/25/2023 00:33:15 - INFO - __main__ - !!!!!!At this step throughput is 0.4884299545826904 45%|████▌ | 5/11 [02:38<03:07, 31.24s/it] 07/25/2023 00:33:45 - INFO - __main__ - !!!!!!At this step throughput is 0.4916091094101314 55%|█████▍ | 6/11 [03:09<02:35, 31.02s/it] 07/25/2023 00:34:17 - INFO - __main__ - !!!!!!At this step throughput is 0.49364129923765976 64%|██████▎ | 7/11 [03:41<02:05, 31.42s/it] 07/25/2023 00:34:48 - INFO - __main__ - !!!!!!At this step throughput is 0.4954246781847558 73%|███████▎ | 8/11 [04:12<01:33, 31.16s/it] 07/25/2023 00:35:18 - INFO - __main__ - !!!!!!At this step throughput is 0.4971914292369494 82%|████████▏ | 9/11 [04:41<01:01, 30.68s/it] 07/25/2023 00:35:48 - INFO - __main__ - !!!!!!At this step throughput is 0.49877618579058647 91%|█████████ | 10/11 [05:11<00:30, 30.55s/it] {'loss': 1.7188, 'learning_rate': 6e-06, 'epoch': 0.87} 91%|█████████ | 10/11 [05:11<00:30, 30.55s/it] [INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation ***** [INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation ***** [INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >> Num examples = 20 [INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >> Batch size = 8 [INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >> Num examples = 20 [INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >> Batch size = 8 0%| | 0/1 [00:00<?, ?it/s]#033[A #033[A {'eval_loss': 1.104188323020935, 'eval_runtime': 3.1127, 'eval_samples_per_second': 6.425, 'eval_steps_per_second': 0.321, 'epoch': 0.87} 91%|█████████ | 10/11 [05:15<00:30, 30.55s/it] #015100%|██████████| 1/1 [00:00<00:00, 1080.45it/s] #033[A #033[A [INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10 [INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10 [INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json [INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json [INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json [INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json [INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json. [INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json. [INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json [INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json [INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json [INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json [2023-07-25 00:36:15,659] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved! /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-07-25 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-07-25 00:36:15,675] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-25 00:37:16,991] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-07-25 00:37:16,992] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-07-25 00:37:17,699] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now! 07/25/2023 00:37:49 - INFO - __main__ - !!!!!!At this step throughput is 0.49004957528181253 100%|██████████| 11/11 [07:12<00:00, 58.13s/it] [INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935). [INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935). [INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10 [INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10 [2023-07-25 00:37:49,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-25 00:37:49,143] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-25 00:37:49,151] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-07-25 00:37:49,161] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-07-25 00:37:49,180] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-07-25 00:38:05,103] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 230 [2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 231 [2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 232 [2023-07-25 00:38:11,500] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 233

huggingface / transformers