microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.63k stars 4.04k forks source link

RuntimeError: Error(s) in loading state_dict #5570

Open lxd551326 opened 3 months ago

lxd551326 commented 3 months ago

Describe the bug i can only use pytorch to training model with Qwen1.5-7B. but when i use deepSpeed i got a problem with CUDA out of memory image my config with zeRo2 is: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } },

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "none",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

} then i had change param with "offload_optimizer": { "device": "cpu", "pin_memory": true }, then it has an other problem like this:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ai/mydata/models/Qwen1.5-main/examples/sft/finetune.py", line 378, in <module>
[rank1]:     train()
[rank1]:   File "/home/ai/mydata/models/Qwen1.5-main/examples/sft/finetune.py", line 367, in train
[rank1]:     trainer.train(resume_from_checkpoint=True)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2063, in _inner_training_loop
[rank1]:     deepspeed_load_checkpoint(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 432, in deepspeed_load_checkpoint
[rank1]:     load_path, _ = deepspeed_engine.load_checkpoint(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2764, in load_checkpoint
[rank1]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2847, in _load_checkpoint
[rank1]:     self.load_module_state_dict(checkpoint=checkpoint,
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2627, in load_module_state_dict
[rank1]:     self.module.load_state_dict(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2189, in load_state_dict
[rank1]:     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank1]: RuntimeError: Error(s) in loading state_dict for Qwen2ForCausalLM:
[rank1]:        Missing key(s) in state_dict: "model.layers.24.self_attn.q_proj.weight", "model.layers.24.self_attn.q_proj.bias", "model.layers.24.self_attn.k_proj.weight", "model.layers.24.self_attn.k_proj.bias", "model.layers.24.self_attn.v_proj.weight", "model.layers.24.self_attn.v_proj.bias", "model.layers.24.self_attn.o_proj.weight", "model.layers.24.mlp.gate_proj.weight", "model.layers.24.mlp.up_proj.weight", "model.layers.24.mlp.down_proj.weight", "model.layers.24.input_layernorm.weight", "model.layers.24.post_attention_layernorm.weight", "model.layers.25.self_attn.q_proj.weight", "model.layers.25.self_attn.q_proj.bias", "model.layers.25.self_attn.k_proj.weight", "model.layers.25.self_attn.k_proj.bias", "model.layers.25.self_attn.v_proj.weight", "model.layers.25.self_attn.v_proj.bias", "model.layers.25.self_attn.o_proj.weight", "model.layers.25.mlp.gate_proj.weight", "model.layers.25.mlp.up_proj.weight", "model.layers.25.mlp.down_proj.weight", "model.layers.25.input_layernorm.weight", "model.layers.25.post_attention_layernorm.weight", "model.layers.26.self_attn.q_proj.weight", "model.layers.26.self_attn.q_proj.bias", "model.layers.26.self_attn.k_proj.weight", "model.layers.26.self_attn.k_proj.bias", "model.layers.26.self_attn.v_proj.weight", "model.layers.26.self_attn.v_proj.bias", "model.layers.26.self_attn.o_proj.weight", "model.layers.26.mlp.gate_proj.weight", "model.layers.26.mlp.up_proj.weight", "model.layers.26.mlp.down_proj.weight", "model.layers.26.input_layernorm.weight", "model.layers.26.post_attention_layernorm.weight", "model.layers.27.self_attn.q_proj.weight", "model.layers.27.self_attn.q_proj.bias", "model.layers.27.self_attn.k_proj.weight", "model.layers.27.self_attn.k_proj.bias", "model.layers.27.self_attn.v_proj.weight", "model.layers.27.self_attn.v_proj.bias", "model.layers.27.self_attn.o_proj.weight", "model.layers.27.mlp.gate_proj.weight", "model.layers.27.mlp.up_proj.weight", "model.layers.27.mlp.down_proj.weight", "model.layers.27.input_layernorm.weight", "model.layers.27.post_attention_layernorm.weight", "model.layers.28.self_attn.q_proj.weight", "model.layers.28.self_attn.q_proj.bias", "model.layers.28.self_attn.k_proj.weight", "model.layers.28.self_attn.k_proj.bias", "model.layers.28.self_attn.v_proj.weight", "model.layers.28.self_attn.v_proj.bias", "model.layers.28.self_attn.o_proj.weight", "model.layers.28.mlp.gate_proj.weight", "model.layers.28.mlp.up_proj.weight", "model.layers.28.mlp.down_proj.weight", "model.layers.28.input_layernorm.weight", "model.layers.28.post_attention_layernorm.weight", "model.layers.29.self_attn.q_proj.weight", "model.layers.29.self_attn.q_proj.bias", "model.layers.29.self_attn.k_proj.weight", "model.layers.29.self_attn.k_proj.bias", "model.layers.29.self_attn.v_proj.weight", "model.layers.29.self_attn.v_proj.bias", "model.layers.29.self_attn.o_proj.weight", "model.layers.29.mlp.gate_proj.weight", "model.layers.29.mlp.up_proj.weight", "model.layers.29.mlp.down_proj.weight", "model.layers.29.input_layernorm.weight", "model.layers.29.post_attention_layernorm.weight", "model.layers.30.self_attn.q_proj.weight", "model.layers.30.self_attn.q_proj.bias", "model.layers.30.self_attn.k_proj.weight", "model.layers.30.self_attn.k_proj.bias", "model.layers.30.self_attn.v_proj.weight", "model.layers.30.self_attn.v_proj.bias", "model.layers.30.self_attn.o_proj.weight", "model.layers.30.mlp.gate_proj.weight", "model.layers.30.mlp.up_proj.weight", "model.layers.30.mlp.down_proj.weight", "model.layers.30.input_layernorm.weight", "model.layers.30.post_attention_layernorm.weight", "model.layers.31.self_attn.q_proj.weight", "model.layers.31.self_attn.q_proj.bias", "model.layers.31.self_attn.k_proj.weight", "model.layers.31.self_attn.k_proj.bias", "model.layers.31.self_attn.v_proj.weight", "model.layers.31.self_attn.v_proj.bias", "model.layers.31.self_attn.o_proj.weight", "model.layers.31.mlp.gate_proj.weight", "model.layers.31.mlp.up_proj.weight", "model.layers.31.mlp.down_proj.weight", "model.layers.31.input_layernorm.weight", "model.layers.31.post_attention_layernorm.weight". 
[rank1]:        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([151936, 1024]) from checkpoint, the shape in current model is torch.Size([151936, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.0.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.0.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.1.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.1.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.2.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.2.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.3.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.3.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.4.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.4.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.5.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.5.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.6.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.6.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.6.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.7.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.7.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.7.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.8.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.8.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.9.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.9.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.9.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.10.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.10.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.10.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.11.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.11.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.11.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.12.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.12.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.12.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.13.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.13.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.13.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.14.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.14.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.14.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.15.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.15.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.15.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.16.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.16.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.16.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.17.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.17.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.17.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.18.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.18.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.18.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.19.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.19.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.19.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.20.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.20.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.20.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.21.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.21.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.21.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.22.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.22.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.22.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.self_attn.o_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.gate_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.up_proj.weight: copying a param with shape torch.Size([2816, 1024]) from checkpoint, the shape in current model is torch.Size([11008, 4096]).
[rank1]:        size mismatch for model.layers.23.mlp.down_proj.weight: copying a param with shape torch.Size([1024, 2816]) from checkpoint, the shape in current model is torch.Size([4096, 11008]).
[rank1]:        size mismatch for model.layers.23.input_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.layers.23.post_attention_layernorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for model.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([4096]).
[rank1]:        size mismatch for lm_head.weight: copying a param with shape torch.Size([151936, 1024]) from checkpoint, the shape in current model is torch.Size([151936, 4096]).
W0527 01:21:26.844000 140472356030272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1056309 closing signal SIGTERM
W0527 01:21:26.844000 140472356030272 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1056310 closing signal SIGTERM
DavidYanAnDe commented 2 months ago

I have the same question in deepspeed stage3 but for the shape in current model is torch.Size([0]), please someone help us. T_T

tjruwase commented 2 months ago

@lxd551326, it seems you seeing two different issues.

  1. CUDA OOM using DeepSpeed for a model that works with pure pytorch is very strange and should be investigated. Can you provide more repro details for that?
  2. The checkpoint loading problem seems to be due to a mismatch between checkpoint and model definition. Can you check that it works with pytorch only?

For both above cases, it would be very helpful if you provide repro steps?

lhyscau commented 1 month ago
  1. DeepSpeed for a model that works with pure pytorch is very stran

Have you solved the problem? I meet it too. The shape is correct in my program without using deepspeed.

tjruwase commented 1 month ago

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

lhyscau commented 1 month ago

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

I comment the zero_optimizer param in the ds.config file, then the error doesn't happen.