huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

LIama-2 7B fine-tuning with DeepSpeed OOM error during loading the best model at end when `load_best_model_at_end` specified as True. #25027

Closed Neo9061 closed 1 year ago

Neo9061 commented 1 year ago

System Info

Hi Community!

I am using run_clm.py with deepspeed to fine-tune LIama 7B on g5.12xlarge EC2 instance (4 GPU, Total GPU memory 96 GB, vCPUs 48 with 192 GB).

The model is able to be trained successfully until the very end step of loading the best model. Because of load_best_model_at_end argument being True, when trainer.py uses DeepSpeed engine to load the model, it goes OOM.

Who can help?

@sgugger @pacman100 @ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

During the stage of load the best model, I saw weird print out of initialization of deepspeed log that appears at the beginning of the training. Then I verified and saw this line in trainer.py. Then I thought this OOM is due to unnecessary usage of Deep speed init function (as also indicated by the comment above the code).

Next, I uses latest version of Transformers 4.31.0 as I saw it no longer uses deepspeed init to load the best model (line and deepspeed loading function). Then I hit LIama 2 configuration bug. See below. I don't know why during loading the best model, this line of deepspeed is not triggered but this line did.

Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1934] 2023-07-24 02:37:09,873 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2093] 2023-07-24 02:37:09,889 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.4037604331970215).
[INFO|trainer.py:2093] 2023-07-24 02:37:09,889 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.4037604331970215).
Traceback (most recent call last):
  File "/opt/ml/code/run_clm.py", line 229, in <module>
main()
  File "/opt/ml/code/run_clm.py", line 178, in main
train_result = trainer.train()  # load model/optimizer/scheduler states
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1944, in _inner_training_loop
self._load_best_model()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2168, in _load_best_model
load_result = load_sharded_checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 431, in load_sharded_checkpoint
model.load_state_dict(state_dict, strict=False)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
#011size mismatch for model.layers.24.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.24.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.25.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.26.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.27.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.28.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.29.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.30.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.31.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
[2023-07-24 02:37:16,892] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 164
[2023-07-24 02:37:20,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 165
Traceback (most recent call last):
  File "/opt/ml/code/run_clm.py", line 229, in <module>
main()
  File "/opt/ml/code/run_clm.py", line 178, in main
train_result = trainer.train()  # load model/optimizer/scheduler states
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1944, in _inner_training_loop
self._load_best_model()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2168, in _load_best_model
load_result = load_sharded_checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 431, in load_sharded_checkpoint
model.load_state_dict(state_dict, strict=False)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
#011size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.1.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.2.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.3.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.4.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.5.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.6.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.7.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.8.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.9.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.10.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.11.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.12.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.13.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.14.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.15.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.16.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.17.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.18.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.19.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.20.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.21.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.22.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.k_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.v_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.self_attn.o_proj.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.gate_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.up_proj.weight: copying a param with shape torch.Size([11008, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
#011size mismatch for model.layers.23.mlp.down_proj.weight: copying a param with shape torch.Size([4096, 11008]) from checkpoint, the shape in current model is torch.Size([0]).

Then, I thought while I am waiting for HF to fix the configuration issue with LIama 2. I can uses the latest code of loading the best model from transformers 4.31.0 and apply it to the code withtransformers 4.28.1.

Thus I disable load_best_model_at_end, and try to load it after Trainer.train() with following code.

train_result = trainer.train()

checkpoint_dirs = sorted(glob.glob(f"/opt/ml/model/checkpoint-*"))
checkpoint_path = checkpoint_dirs[0] # this is because I set total_save_limit as 1
load_path, _ = trainer.model_wrapped.load_checkpoint(
           checkpoint_path, load_optimizer_states=False, load_lr_scheduler_states=False
)

trainer.save_model()

I hit OOM when I specified load_optimizer_states and load_lr_scheduler_states being True. Then I thought since the model I save is used for evaluation/inference only rather than resuming training from the checkpoints. Thus I don't need optimzer and lr scheduler. However, when I set them as False, I still hit the error.

Please advise what you think on this issue. THX!

Expected behavior

I expect the best model to be loaded without OOM error as the model can be trained successfully before hitting the final saving step.

pacman100 commented 1 year ago

Hello, I'm able to run the following minimal example with any issues:

export WANDB_DISABLED="true"
export CUDA_VISIBLE_DEVICES="0,1"
cd transformers
deepspeed --num_nodes 1 --num_gpus 2 --master_port 10999 /home/sourab/transformers/examples/pytorch/language-modeling/run_clm.py     --model_name_or_path gpt2     --dataset_name wikitext     --dataset_config_name wikitext-2-raw-v1     --per_device_train_batch_size 1     --per_device_eval_batch_size 1     --do_train     --do_eval     --max_train_samples 30     --max_eval_samples 10     --block_size 512     --overwrite_output_dir     --gradient_checkpointing     --save_strategy "steps"     --evaluation_strategy "steps"     --eval_steps 10     --save_steps 10     --load_best_model_at_end     --output_dir /tmp/test-clm     --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json

output:

2023-07-24 10:39:47,947] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": false
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 5e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.0
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 5e-05, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 5.898240e+05, 
        "stage3_prefetch_bucket_size": 5.308416e+05, 
        "stage3_param_persistence_threshold": 7.680000e+03, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 1, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
}
[INFO|trainer.py:1682] 2023-07-24 10:39:47,947 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-24 10:39:47,947 >>   Num examples = 30
[INFO|trainer.py:1684] 2023-07-24 10:39:47,947 >>   Num Epochs = 3
[INFO|trainer.py:1685] 2023-07-24 10:39:47,947 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1688] 2023-07-24 10:39:47,947 >>   Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:1689] 2023-07-24 10:39:47,947 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1690] 2023-07-24 10:39:47,947 >>   Total optimization steps = 45
[INFO|trainer.py:1691] 2023-07-24 10:39:47,947 >>   Number of trainable parameters = 124,439,808
  0%|                                                                                                   | 0/45 [00:00<?, ?it/s][WARNING|logging.py:295] 2023-07-24 10:39:48,027 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[WARNING|logging.py:295] 2023-07-24 10:39:48,027 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
 22%|████████████████████                                                                      | 10/45 [00:05<00:15,  2.27it/s][INFO|trainer.py:3081] 2023-07-24 10:39:53,150 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-24 10:39:53,150 >>   Num examples = 10
[INFO|trainer.py:3086] 2023-07-24 10:39:53,151 >>   Batch size = 1
{'eval_loss': 3.356262683868408, 'eval_accuracy': 0.3947162426614481, 'eval_runtime': 0.5527, 'eval_samples_per_second': 18.092, 'eval_steps_per_second': 9.046, 'epoch': 0.67}                                                                               
 22%|████████████████████                                                                      | 10/45 [00:05<00:15,  2.27it/s[INFO|trainer.py:2807] 2023-07-24 10:39:53,991 >> Saving model checkpoint to /tmp/test-clm/checkpoint-10                        
[INFO|configuration_utils.py:458] 2023-07-24 10:39:53,991 >> Configuration saved in /tmp/test-clm/checkpoint-10/config.json
[INFO|configuration_utils.py:379] 2023-07-24 10:39:53,992 >> Configuration saved in /tmp/test-clm/checkpoint-10/generation_config.json
[INFO|modeling_utils.py:1855] 2023-07-24 10:39:54,649 >> Model weights saved in /tmp/test-clm/checkpoint-10/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-07-24 10:39:54,650 >> tokenizer config file saved in /tmp/test-clm/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-24 10:39:54,650 >> Special tokens file saved in /tmp/test-clm/checkpoint-10/special_tokens_map.json
[2023-07-24 10:39:54,735] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-24 10:39:54,738] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-24 10:39:54,738] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:39:54,744] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:39:54,744] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-24 10:39:57,379] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-24 10:39:57,379] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-24 10:39:57,386] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
 44%|████████████████████████████████████████                                                  | 20/45 [00:13<00:12,  2.07it/s][INFO|trainer.py:3081] 2023-07-24 10:40:01,597 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-24 10:40:01,598 >>   Num examples = 10
[INFO|trainer.py:3086] 2023-07-24 10:40:01,598 >>   Batch size = 1
{'eval_loss': 3.3019282817840576, 'eval_accuracy': 0.40371819960861055, 'eval_runtime': 0.3621, 'eval_samples_per_second': 27.618, 'eval_steps_per_second': 13.809, 'epoch': 1.33}                                                                            
 44%|████████████████████████████████████████                                                  | 20/45 [00:14<00:12,  2.07it/s[INFO|trainer.py:2807] 2023-07-24 10:40:02,302 >> Saving model checkpoint to /tmp/test-clm/checkpoint-20                        
[INFO|configuration_utils.py:458] 2023-07-24 10:40:02,303 >> Configuration saved in /tmp/test-clm/checkpoint-20/config.json
[INFO|configuration_utils.py:379] 2023-07-24 10:40:02,303 >> Configuration saved in /tmp/test-clm/checkpoint-20/generation_config.json
[INFO|modeling_utils.py:1855] 2023-07-24 10:40:02,971 >> Model weights saved in /tmp/test-clm/checkpoint-20/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:02,971 >> tokenizer config file saved in /tmp/test-clm/checkpoint-20/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:02,972 >> Special tokens file saved in /tmp/test-clm/checkpoint-20/special_tokens_map.json
[2023-07-24 10:40:03,063] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step20 is about to be saved!
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-24 10:40:03,066] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-24 10:40:03,066] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:40:03,080] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:40:03,081] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-24 10:40:06,196] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-24 10:40:06,197] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-20/global_step20/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-24 10:40:06,204] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step20 is ready now!
 67%|████████████████████████████████████████████████████████████                              | 30/45 [00:22<00:07,  2.01it/s][INFO|trainer.py:3081] 2023-07-24 10:40:10,531 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-24 10:40:10,531 >>   Num examples = 10
[INFO|trainer.py:3086] 2023-07-24 10:40:10,531 >>   Batch size = 1
{'eval_loss': 3.2902770042419434, 'eval_accuracy': 0.40332681017612526, 'eval_runtime': 0.4135, 'eval_samples_per_second': 24.186, 'eval_steps_per_second': 12.093, 'epoch': 2.0}                                                                             
 67%|████████████████████████████████████████████████████████████                              | 30/45 [00:22<00:07,  2.01it/s[INFO|trainer.py:2807] 2023-07-24 10:40:11,199 >> Saving model checkpoint to /tmp/test-clm/checkpoint-30                        
[INFO|configuration_utils.py:458] 2023-07-24 10:40:11,200 >> Configuration saved in /tmp/test-clm/checkpoint-30/config.json
[INFO|configuration_utils.py:379] 2023-07-24 10:40:11,200 >> Configuration saved in /tmp/test-clm/checkpoint-30/generation_config.json
[INFO|modeling_utils.py:1855] 2023-07-24 10:40:12,098 >> Model weights saved in /tmp/test-clm/checkpoint-30/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:12,098 >> tokenizer config file saved in /tmp/test-clm/checkpoint-30/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:12,098 >> Special tokens file saved in /tmp/test-clm/checkpoint-30/special_tokens_map.json
[2023-07-24 10:40:12,188] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step30 is about to be saved!
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-24 10:40:12,191] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-24 10:40:12,191] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:40:12,197] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:40:12,198] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-24 10:40:15,492] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-24 10:40:15,492] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-24 10:40:15,499] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step30 is ready now!
 89%|████████████████████████████████████████████████████████████████████████████████          | 40/45 [00:31<00:02,  2.02it/s][INFO|trainer.py:3081] 2023-07-24 10:40:19,832 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-24 10:40:19,832 >>   Num examples = 10
[INFO|trainer.py:3086] 2023-07-24 10:40:19,832 >>   Batch size = 1
{'eval_loss': 3.3038055896759033, 'eval_accuracy': 0.40136986301369865, 'eval_runtime': 0.4144, 'eval_samples_per_second': 24.13, 'eval_steps_per_second': 12.065, 'epoch': 2.67}                                                                             
 89%|████████████████████████████████████████████████████████████████████████████████          | 40/45 [00:32<00:02,  2.02it/s[INFO|trainer.py:2807] 2023-07-24 10:40:20,497 >> Saving model checkpoint to /tmp/test-clm/checkpoint-40                        
[INFO|configuration_utils.py:458] 2023-07-24 10:40:20,497 >> Configuration saved in /tmp/test-clm/checkpoint-40/config.json
[INFO|configuration_utils.py:379] 2023-07-24 10:40:20,498 >> Configuration saved in /tmp/test-clm/checkpoint-40/generation_config.json
[INFO|modeling_utils.py:1855] 2023-07-24 10:40:21,169 >> Model weights saved in /tmp/test-clm/checkpoint-40/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:21,169 >> tokenizer config file saved in /tmp/test-clm/checkpoint-40/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:21,169 >> Special tokens file saved in /tmp/test-clm/checkpoint-40/special_tokens_map.json
[2023-07-24 10:40:21,259] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step40 is about to be saved!
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-24 10:40:21,262] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-24 10:40:21,262] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:40:21,268] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:40:21,268] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-24 10:40:23,964] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-24 10:40:23,964] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /tmp/test-clm/checkpoint-40/global_step40/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-24 10:40:23,971] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step40 is ready now!
100%|██████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:38<00:00,  1.37it/s][INFO|trainer.py:1930] 2023-07-24 10:40:26,063 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)

[INFO|trainer.py:2089] 2023-07-24 10:40:26,063 >> Loading best model from /tmp/test-clm/checkpoint-30 (score: 3.2902770042419434).
[INFO|deepspeed.py:381] 2023-07-24 10:40:26,063 >> Attempting to resume from /tmp/test-clm/checkpoint-30
[2023-07-24 10:40:26,073] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:40:26,077] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:40:26,078] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-24 10:40:26,082] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-24 10:40:26,086] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-24 10:40:26,479] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /tmp/test-clm/checkpoint-30/global_step30/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-24 10:40:26,479] [INFO] [engine.py:2865:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0
[2023-07-24 10:40:26,605] [INFO] [engine.py:2815:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0
{'train_runtime': 38.7307, 'train_samples_per_second': 2.324, 'train_steps_per_second': 1.162, 'train_loss': 3.3458041720920138, 'epoch': 3.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:38<00:00,  1.16it/s]
[INFO|trainer.py:2807] 2023-07-24 10:40:26,966 >> Saving model checkpoint to /tmp/test-clm
[INFO|configuration_utils.py:458] 2023-07-24 10:40:26,967 >> Configuration saved in /tmp/test-clm/config.json
[INFO|configuration_utils.py:379] 2023-07-24 10:40:26,967 >> Configuration saved in /tmp/test-clm/generation_config.json
[INFO|modeling_utils.py:1855] 2023-07-24 10:40:28,333 >> Model weights saved in /tmp/test-clm/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-07-24 10:40:28,333 >> tokenizer config file saved in /tmp/test-clm/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-24 10:40:28,333 >> Special tokens file saved in /tmp/test-clm/special_tokens_map.json
***** train metrics *****
  epoch                    =        3.0
  train_loss               =     3.3458
  train_runtime            = 0:00:38.73
  train_samples            =         30
  train_samples_per_second =      2.324
  train_steps_per_second   =      1.162
07/24/2023 10:40:28 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3081] 2023-07-24 10:40:28,418 >> ***** Running Evaluation *****
[INFO|trainer.py:3083] 2023-07-24 10:40:28,418 >>   Num examples = 10
[INFO|trainer.py:3086] 2023-07-24 10:40:28,418 >>   Batch size = 1
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15.77it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.4033
  eval_loss               =     3.2903
  eval_runtime            = 0:00:00.38
  eval_samples            =         10
  eval_samples_per_second =     26.017
  eval_steps_per_second   =     13.009
  perplexity              =    26.8503
[2023-07-24 10:40:30,989] [INFO] [launch.py:347:main] Process 1140775 exits successfully.
[2023-07-24 10:40:31,991] [INFO] [launch.py:347:main] Process 1140774 exits successfully.
Neo9061 commented 1 year ago

Thanks @pacman100! Which model are you using in above example? previously I am also able to successfully run with GPT-Neo models (a relative small model) but hit issue with large models like Falcon 7B and IIama 2 7B on g5.12xlarge.

pacman100 commented 1 year ago

Hello @Neo9061, above PR https://github.com/huggingface/transformers/pull/25057 should fix this, please confirm the same.

Neo9061 commented 1 year ago

Thanks @pacman100 for the quick fix! Just for my understanding, any insight why I used the code from transformers 4.31.0 (shown as below) and still hit the OOM error? I mean for my previous investigation. (for context details, plz see my post above. THX!)

At the meanwhile, I am testing your fix above. Will update in this thread.

train_result = trainer.train()

checkpoint_dirs = sorted(glob.glob(f"/opt/ml/model/checkpoint-*"))
checkpoint_path = checkpoint_dirs[0] # this is because I set total_save_limit as 1
load_path, _ = trainer.model_wrapped.load_checkpoint(
           checkpoint_path, load_optimizer_states=False, load_lr_scheduler_states=False
)

trainer.save_model()
pacman100 commented 1 year ago

Hello, see this issue: https://github.com/huggingface/accelerate/issues/1707

lzy37ld commented 1 year ago

Hi sorry for a probably unrelated problem here. If I want to save the model in fp16 version, what should I do? Since I know fp16(AMP) is a way of accelerating the training process and saving mem in some cases, but the saved parameters are still fp32.

I just wanna do the same sth similar to the Llama model whose parameters are the fp16 version so that we can do faster about inferences.

Neo9061 commented 1 year ago

Hi @pacman100 I still see the error using your branch of transfromers. See log below. Please let me know if there is anything you want me to provide. THX!

Second thought: for evaluation/inference purpose, I don't need optimizer and lr scheduler. Is there a way to not save those parameters to save some memory?

[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9090172bf0>
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   communication_data_type ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   disable_allgather ............ False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dump_state ................... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   global_rank .................. 0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 2
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_clipping ............ 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2023-07-25 00:30:36,502] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_name ............... adamw
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   optimizer_params ............. {'lr': 6e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pld_enabled .................. False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   pld_params ................... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   scheduler_name ............... WarmupLR
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 6e-06, 'warmup_num_steps': 2}
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   sparse_attention ............. None
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   steps_per_print .............. inf
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   train_batch_size ............. 16
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  2
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   use_node_local_storage ....... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   world_size ................... 4
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_enabled ................. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2023-07-25 00:30:36,503] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2023-07-25 00:30:36,503] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 12, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 6e-06, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.2
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 6e-06, 
            "warmup_num_steps": 2
        }
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": false
        }, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": false
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 1.677722e+07, 
        "stage3_prefetch_bucket_size": 1.509949e+07, 
        "stage3_param_persistence_threshold": 4.096000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_fp16_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 2, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 2, 
    "wall_clock_breakdown": false
}
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >>   Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >>   Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >>   Total optimization steps = 11
[INFO|trainer.py:1682] 2023-07-25 00:30:36,503 >> ***** Running training *****
[INFO|trainer.py:1683] 2023-07-25 00:30:36,503 >>   Num examples = 180
[INFO|trainer.py:1684] 2023-07-25 00:30:36,503 >>   Num Epochs = 1
[INFO|trainer.py:1685] 2023-07-25 00:30:36,504 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1688] 2023-07-25 00:30:36,504 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1689] 2023-07-25 00:30:36,504 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:1690] 2023-07-25 00:30:36,504 >>   Total optimization steps = 11
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >>   Number of trainable parameters = 6,738,448,384
[INFO|trainer.py:1691] 2023-07-25 00:30:36,505 >>   Number of trainable parameters = 6,738,448,384
0%|          | 0/11 [00:00<?, ?it/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-07-25 00:30:36,510 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
07/25/2023 00:31:11 - INFO - __main__ -   !!!!!!At this step throughput is 0.45318892143877243
9%|▉         | 1/11 [00:35<05:53, 35.31s/it]
07/25/2023 00:31:42 - INFO - __main__ -   !!!!!!At this step throughput is 0.47042510136622717
18%|█▊        | 2/11 [01:05<04:51, 32.37s/it]
07/25/2023 00:32:13 - INFO - __main__ -   !!!!!!At this step throughput is 0.47886025282245415
27%|██▋       | 3/11 [01:36<04:14, 31.84s/it]
07/25/2023 00:32:44 - INFO - __main__ -   !!!!!!At this step throughput is 0.4844130442539049
36%|███▋      | 4/11 [02:07<03:40, 31.47s/it]
07/25/2023 00:33:15 - INFO - __main__ -   !!!!!!At this step throughput is 0.4884299545826904
45%|████▌     | 5/11 [02:38<03:07, 31.24s/it]
07/25/2023 00:33:45 - INFO - __main__ -   !!!!!!At this step throughput is 0.4916091094101314
55%|█████▍    | 6/11 [03:09<02:35, 31.02s/it]
07/25/2023 00:34:17 - INFO - __main__ -   !!!!!!At this step throughput is 0.49364129923765976
64%|██████▎   | 7/11 [03:41<02:05, 31.42s/it]
07/25/2023 00:34:48 - INFO - __main__ -   !!!!!!At this step throughput is 0.4954246781847558
73%|███████▎  | 8/11 [04:12<01:33, 31.16s/it]
07/25/2023 00:35:18 - INFO - __main__ -   !!!!!!At this step throughput is 0.4971914292369494
82%|████████▏ | 9/11 [04:41<01:01, 30.68s/it]
07/25/2023 00:35:48 - INFO - __main__ -   !!!!!!At this step throughput is 0.49877618579058647
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
{'loss': 1.7188, 'learning_rate': 6e-06, 'epoch': 0.87}
91%|█████████ | 10/11 [05:11<00:30, 30.55s/it]
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3080] 2023-07-25 00:35:48,400 >> ***** Running Evaluation *****
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >>   Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >>   Batch size = 8
[INFO|trainer.py:3082] 2023-07-25 00:35:48,400 >>   Num examples = 20
[INFO|trainer.py:3085] 2023-07-25 00:35:48,400 >>   Batch size = 8
0%|          | 0/1 [00:00<?, ?it/s]#033[A
#033[A
{'eval_loss': 1.104188323020935, 'eval_runtime': 3.1127, 'eval_samples_per_second': 6.425, 'eval_steps_per_second': 0.321, 'epoch': 0.87}
91%|█████████ | 10/11 [05:15<00:30, 30.55s/it]
#015100%|██████████| 1/1 [00:00<00:00, 1080.45it/s]
#033[A
#033[A
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|trainer.py:2806] 2023-07-25 00:36:03,394 >> Saving model checkpoint to /opt/ml/model/checkpoint-10
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:458] 2023-07-25 00:36:03,394 >> Configuration saved in /opt/ml/model/checkpoint-10/config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|configuration_utils.py:379] 2023-07-25 00:36:03,395 >> Configuration saved in /opt/ml/model/checkpoint-10/generation_config.json
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|modeling_utils.py:1863] 2023-07-25 00:36:15,055 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /opt/ml/model/checkpoint-10/pytorch_model.bin.index.json.
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2210] 2023-07-25 00:36:15,055 >> tokenizer config file saved in /opt/ml/model/checkpoint-10/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[INFO|tokenization_utils_base.py:2217] 2023-07-25 00:36:15,055 >> Special tokens file saved in /opt/ml/model/checkpoint-10/special_tokens_map.json
[2023-07-25 00:36:15,659] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-07-25 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-07-25 00:36:15,675] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:36:15,689] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:37:16,991] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-07-25 00:37:16,992] [INFO] [engine.py:3285:_save_zero_checkpoint] zero checkpoint saved /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-07-25 00:37:17,699] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
07/25/2023 00:37:49 - INFO - __main__ -   !!!!!!At this step throughput is 0.49004957528181253
100%|██████████| 11/11 [07:12<00:00, 58.13s/it]
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:1930] 2023-07-25 00:37:49,056 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|trainer.py:2089] 2023-07-25 00:37:49,058 >> Loading best model from /opt/ml/model/checkpoint-10 (score: 1.104188323020935).
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[INFO|deepspeed.py:381] 2023-07-25 00:37:49,060 >> Attempting to resume from /opt/ml/model/checkpoint-10
[2023-07-25 00:37:49,109] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,143] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,151] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-07-25 00:37:49,161] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /opt/ml/model/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-07-25 00:37:49,180] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /opt/ml/model/checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-07-25 00:38:05,103] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 230
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 231
[2023-07-25 00:38:08,243] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 232
[2023-07-25 00:38:11,500] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 233
Neo9061 commented 1 year ago

Second thought: how can I get away with load the best model using Trainer and implement it outside of Trainer? like this line in clm.py https://github.com/philschmid/huggingface-llama-2-samples/blob/18838c203285e7eefa2169e5413db4b8e8013a02/training/scripts/run_clm.py#L238

Neo9061 commented 1 year ago

Hi @pacman100 gentle bump on above issue to see if there is anything I can provide to let you better root cause. THX a lot!

pacman100 commented 1 year ago

Hello, see this issue: https://github.com/huggingface/accelerate/issues/1707

As mentioned, this is the issue and isn't related to DeepSpeed integration. Please follow up with the DeepSpeed team