hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
631 stars 90 forks source link

Failed to load pre-trained model weights for OPT_125M #217

Closed zhengmk321 closed 1 year ago

zhengmk321 commented 1 year ago

Hi, I have some difficulties loading the pre-trained model weights for OPT_125M provided by Meta. Here are the error messages: Process SpawnProcess-1:

Traceback (most recent call last):
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/worker.py", line 30, in __init__
    self.model: nn.Module = model_fn(**model_kwargs).cuda()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 283, in opt_125M
    return create_pipeline_model(**model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 213, in create_pipeline_model
    load_checkpoint(model_kwargs["checkpoint"], model, preprocess_fn=preprocess_fn, **model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/utils/checkpointing.py", line 95, in load_checkpoint
    model.load_state_dict(model_state, strict=strict)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        Missing key(s) in state_dict: "blocks.0.norm1.module.weight", "blocks.0.norm1.module.bias", "blocks.0.norm2.module.weight", "blocks.0.norm2.module.bias", "blocks.1.norm1.module.weight", "blocks.1.norm1.module.bias", "blocks.1.norm2.module.weight", "blocks.1.norm2.module.bias", "blocks.2.norm1.module.weight", "blocks.2.norm1.module.bias", "blocks.2.norm2.module.weight", "blocks.2.norm2.module.bias", "blocks.3.norm1.module.weight", "blocks.3.norm1.module.bias", "blocks.3.norm2.module.weight", "blocks.3.norm2.module.bias", "blocks.4.norm1.module.weight", "blocks.4.norm1.module.bias", "blocks.4.norm2.module.weight", "blocks.4.norm2.module.bias", "blocks.5.norm1.module.weight", "blocks.5.norm1.module.bias", "blocks.5.norm2.module.weight", "blocks.5.norm2.module.bias", "blocks.6.norm1.module.weight", "blocks.6.norm1.module.bias", "blocks.6.norm2.module.weight", "blocks.6.norm2.module.bias", "blocks.7.norm1.module.weight", "blocks.7.norm1.module.bias", "blocks.7.norm2.module.weight", "blocks.7.norm2.module.bias", "blocks.8.norm1.module.weight", "blocks.8.norm1.module.bias", "blocks.8.norm2.module.weight", "blocks.8.norm2.module.bias", "blocks.9.norm1.module.weight", "blocks.9.norm1.module.bias", "blocks.9.norm2.module.weight", "blocks.9.norm2.module.bias", "blocks.10.norm1.module.weight", "blocks.10.norm1.module.bias", "blocks.10.norm2.module.weight", "blocks.10.norm2.module.bias", "blocks.11.norm1.module.weight", "blocks.11.norm1.module.bias", "blocks.11.norm2.module.weight", "blocks.11.norm2.module.bias", "norm.module.weight", "norm.module.bias". 
        Unexpected key(s) in state_dict: "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.1.norm1.weight", "blocks.1.norm1.bias", "blocks.1.norm2.weight", "blocks.1.norm2.bias", "blocks.2.norm1.weight", "blocks.2.norm1.bias", "blocks.2.norm2.weight", "blocks.2.norm2.bias", "blocks.3.norm1.weight", "blocks.3.norm1.bias", "blocks.3.norm2.weight", "blocks.3.norm2.bias", "blocks.4.norm1.weight", "blocks.4.norm1.bias", "blocks.4.norm2.weight", "blocks.4.norm2.bias", "blocks.5.norm1.weight", "blocks.5.norm1.bias", "blocks.5.norm2.weight", "blocks.5.norm2.bias", "blocks.6.norm1.weight", "blocks.6.norm1.bias", "blocks.6.norm2.weight", "blocks.6.norm2.bias", "blocks.7.norm1.weight", "blocks.7.norm1.bias", "blocks.7.norm2.weight", "blocks.7.norm2.bias", "blocks.8.norm1.weight", "blocks.8.norm1.bias", "blocks.8.norm2.weight", "blocks.8.norm2.bias", "blocks.9.norm1.weight", "blocks.9.norm1.bias", "blocks.9.norm2.weight", "blocks.9.norm2.bias", "blocks.10.norm1.weight", "blocks.10.norm1.bias", "blocks.10.norm2.weight", "blocks.10.norm2.bias", "blocks.11.norm1.weight", "blocks.11.norm1.bias", "blocks.11.norm2.weight", "blocks.11.norm2.bias", "norm.weight", "norm.bias". 
load 1 files using 1 procs
Load file time: 0.136 s

Seems that load_checkpoint() and the data in checkpoint.pt have different naming conventions. Is this caused by a version issue? I am using energonai==0.0.2.

Thanks for your help in advance.

zhengmk321 commented 1 year ago

HI, I figured out the issue above. Turns out that I indeed installed the wrong version of energonai. After installing the correct one by following the instructions in the README.md, I tried hosting OPT_30B model, but I got these error messages:

Process SpawnProcess-1:
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/worker.py", line 32, in __init__
    self.model: nn.Module = model_fn(**model_kwargs).cuda()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 323, in opt_30B
    return create_pipeline_model(**model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 223, in create_pipeline_model
    load_checkpoint(model_kwargs["checkpoint"], model, preprocess_fn=preprocess_fn, **model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/utils/checkpointing.py", line 95, in load_checkpoint
    model.load_state_dict(model_state, strict=strict)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        size mismatch for blocks.0.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.0.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.0.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.0.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.1.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.1.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.1.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.1.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.2.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.2.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.2.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.2.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.3.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.3.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.3.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.3.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.4.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.4.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.4.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.4.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.5.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.5.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.5.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.5.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.6.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.6.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.6.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.6.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.7.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.7.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.7.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.7.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.8.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.8.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.8.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.8.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.9.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.9.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.9.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.9.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.10.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.10.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.10.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.10.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.11.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.11.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.11.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.11.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.12.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.12.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.12.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.12.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.13.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.13.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.13.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.13.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.14.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.14.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.14.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.14.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.15.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.15.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.15.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.15.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.16.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.16.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.16.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.16.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.17.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.17.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.17.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.17.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.18.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.18.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.18.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.18.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.19.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.19.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.19.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.19.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.20.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.20.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.20.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.20.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.21.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.21.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.21.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.21.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.22.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.22.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.22.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.22.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.23.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.23.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.23.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.23.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.24.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.24.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.24.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.24.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.25.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.25.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.25.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.25.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.26.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.26.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.26.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.26.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.27.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.27.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.27.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.27.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.28.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.28.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.28.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.28.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.29.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.29.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.29.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.29.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.30.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.30.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.30.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.30.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.31.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.31.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.31.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.31.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.32.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.32.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.32.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.32.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.33.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.33.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.33.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.33.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.34.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.34.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.34.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.34.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.35.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.35.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.35.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.35.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.36.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.36.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.36.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.36.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.37.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.37.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.37.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.37.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.38.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.38.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.38.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.38.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.39.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.39.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.39.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.39.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.40.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.40.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.40.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.40.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.41.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.41.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.41.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.41.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.42.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.42.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.42.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.42.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.43.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.43.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.43.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.43.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.44.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.44.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.44.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.44.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.45.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.45.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.45.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.45.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.46.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.46.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.46.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.46.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.47.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.47.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.47.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.47.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for head.dense.weight: copying a param with shape torch.Size([12568, 3584]) from checkpoint, the shape in current model is torch.Size([50272, 3584]).
Traceback (most recent call last):
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/worker.py", line 32, in __init__
    self.model: nn.Module = model_fn(**model_kwargs).cuda()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 323, in opt_30B
    return create_pipeline_model(**model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 223, in create_pipeline_model
    load_checkpoint(model_kwargs["checkpoint"], model, preprocess_fn=preprocess_fn, **model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/utils/checkpointing.py", line 95, in load_checkpoint
    model.load_state_dict(model_state, strict=strict)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        Missing key(s) in state_dict: "blocks.0.attn.query_.weight", "blocks.0.attn.query_.bias", "blocks.0.attn.key_.weight", "blocks.0.attn.key_.bias", "blocks.0.attn.value_.weight", "blocks.0.attn.value_.bias", "blocks.1.attn.query_.weight", "blocks.1.attn.query_.bias", "blocks.1.attn.key_.weight", "blocks.1.attn.key_.bias", "blocks.1.attn.value_.weight", "blocks.1.attn.value_.bias", "blocks.2.attn.query_.weight", "blocks.2.attn.query_.bias", "blocks.2.attn.key_.weight", "blocks.2.attn.key_.bias", "blocks.2.attn.value_.weight", "blocks.2.attn.value_.bias", "blocks.3.attn.query_.weight", "blocks.3.attn.query_.bias", "blocks.3.attn.key_.weight", "blocks.3.attn.key_.bias", "blocks.3.attn.value_.weight", "blocks.3.attn.value_.bias", "blocks.4.attn.query_.weight", "blocks.4.attn.query_.bias", "blocks.4.attn.key_.weight", "blocks.4.attn.key_.bias", "blocks.4.attn.value_.weight", "blocks.4.attn.value_.bias", "blocks.5.attn.query_.weight", "blocks.5.attn.query_.bias", "blocks.5.attn.key_.weight", "blocks.5.attn.key_.bias", "blocks.5.attn.value_.weight", "blocks.5.attn.value_.bias", "blocks.6.attn.query_.weight", "blocks.6.attn.query_.bias", "blocks.6.attn.key_.weight", "blocks.6.attn.key_.bias", "blocks.6.attn.value_.weight", "blocks.6.attn.value_.bias", "blocks.7.attn.query_.weight", "blocks.7.attn.query_.bias", "blocks.7.attn.key_.weight", "blocks.7.attn.key_.bias", "blocks.7.attn.value_.weight", "blocks.7.attn.value_.bias", "blocks.8.attn.query_.weight", "blocks.8.attn.query_.bias", "blocks.8.attn.key_.weight", "blocks.8.attn.key_.bias", "blocks.8.attn.value_.weight", "blocks.8.attn.value_.bias", "blocks.9.attn.query_.weight", "blocks.9.attn.query_.bias", "blocks.9.attn.key_.weight", "blocks.9.attn.key_.bias", "blocks.9.attn.value_.weight", "blocks.9.attn.value_.bias", "blocks.10.attn.query_.weight", "blocks.10.attn.query_.bias", "blocks.10.attn.key_.weight", "blocks.10.attn.key_.bias", "blocks.10.attn.value_.weight", "blocks.10.attn.value_.bias", "blocks.11.attn.query_.weight", "blocks.11.attn.query_.bias", "blocks.11.attn.key_.weight", "blocks.11.attn.key_.bias", "blocks.11.attn.value_.weight", "blocks.11.attn.value_.bias", "blocks.12.attn.query_.weight", "blocks.12.attn.query_.bias", "blocks.12.attn.key_.weight", "blocks.12.attn.key_.bias", "blocks.12.attn.value_.weight", "blocks.12.attn.value_.bias", "blocks.13.attn.query_.weight", "blocks.13.attn.query_.bias", "blocks.13.attn.key_.weight", "blocks.13.attn.key_.bias", "blocks.13.attn.value_.weight", "blocks.13.attn.value_.bias", "blocks.14.attn.query_.weight", "blocks.14.attn.query_.bias", "blocks.14.attn.key_.weight", "blocks.14.attn.key_.bias", "blocks.14.attn.value_.weight", "blocks.14.attn.value_.bias", "blocks.15.attn.query_.weight", "blocks.15.attn.query_.bias", "blocks.15.attn.key_.weight", "blocks.15.attn.key_.bias", "blocks.15.attn.value_.weight", "blocks.15.attn.value_.bias", "blocks.16.attn.query_.weight", "blocks.16.attn.query_.bias", "blocks.16.attn.key_.weight", "blocks.16.attn.key_.bias", "blocks.16.attn.value_.weight", "blocks.16.attn.value_.bias", "blocks.17.attn.query_.weight", "blocks.17.attn.query_.bias", "blocks.17.attn.key_.weight", "blocks.17.attn.key_.bias", "blocks.17.attn.value_.weight", "blocks.17.attn.value_.bias", "blocks.18.attn.query_.weight", "blocks.18.attn.query_.bias", "blocks.18.attn.key_.weight", "blocks.18.attn.key_.bias", "blocks.18.attn.value_.weight", "blocks.18.attn.value_.bias", "blocks.19.attn.query_.weight", "blocks.19.attn.query_.bias", "blocks.19.attn.key_.weight", "blocks.19.attn.key_.bias", "blocks.19.attn.value_.weight", "blocks.19.attn.value_.bias", "blocks.20.attn.query_.weight", "blocks.20.attn.query_.bias", "blocks.20.attn.key_.weight", "blocks.20.attn.key_.bias", "blocks.20.attn.value_.weight", "blocks.20.attn.value_.bias", "blocks.21.attn.query_.weight", "blocks.21.attn.query_.bias", "blocks.21.attn.key_.weight", "blocks.21.attn.key_.bias", "blocks.21.attn.value_.weight", "blocks.21.attn.value_.bias", "blocks.22.attn.query_.weight", "blocks.22.attn.query_.bias", "blocks.22.attn.key_.weight", "blocks.22.attn.key_.bias", "blocks.22.attn.value_.weight", "blocks.22.attn.value_.bias", "blocks.23.attn.query_.weight", "blocks.23.attn.query_.bias", "blocks.23.attn.key_.weight", "blocks.23.attn.key_.bias", "blocks.23.attn.value_.weight", "blocks.23.attn.value_.bias", "blocks.24.attn.query_.weight", "blocks.24.attn.query_.bias", "blocks.24.attn.key_.weight", "blocks.24.attn.key_.bias", "blocks.24.attn.value_.weight", "blocks.24.attn.value_.bias", "blocks.25.attn.query_.weight", "blocks.25.attn.query_.bias", "blocks.25.attn.key_.weight", "blocks.25.attn.key_.bias", "blocks.25.attn.value_.weight", "blocks.25.attn.value_.bias", "blocks.26.attn.query_.weight", "blocks.26.attn.query_.bias", "blocks.26.attn.key_.weight", "blocks.26.attn.key_.bias", "blocks.26.attn.value_.weight", "blocks.26.attn.value_.bias", "blocks.27.attn.query_.weight", "blocks.27.attn.query_.bias", "blocks.27.attn.key_.weight", "blocks.27.attn.key_.bias", "blocks.27.attn.value_.weight", "blocks.27.attn.value_.bias", "blocks.28.attn.query_.weight", "blocks.28.attn.query_.bias", "blocks.28.attn.key_.weight", "blocks.28.attn.key_.bias", "blocks.28.attn.value_.weight", "blocks.28.attn.value_.bias", "blocks.29.attn.query_.weight", "blocks.29.attn.query_.bias", "blocks.29.attn.key_.weight", "blocks.29.attn.key_.bias", "blocks.29.attn.value_.weight", "blocks.29.attn.value_.bias", "blocks.30.attn.query_.weight", "blocks.30.attn.query_.bias", "blocks.30.attn.key_.weight", "blocks.30.attn.key_.bias", "blocks.30.attn.value_.weight", "blocks.30.attn.value_.bias", "blocks.31.attn.query_.weight", "blocks.31.attn.query_.bias", "blocks.31.attn.key_.weight", "blocks.31.attn.key_.bias", "blocks.31.attn.value_.weight", "blocks.31.attn.value_.bias", "blocks.32.attn.query_.weight", "blocks.32.attn.query_.bias", "blocks.32.attn.key_.weight", "blocks.32.attn.key_.bias", "blocks.32.attn.value_.weight", "blocks.32.attn.value_.bias", "blocks.33.attn.query_.weight", "blocks.33.attn.query_.bias", "blocks.33.attn.key_.weight", "blocks.33.attn.key_.bias", "blocks.33.attn.value_.weight", "blocks.33.attn.value_.bias", "blocks.34.attn.query_.weight", "blocks.34.attn.query_.bias", "blocks.34.attn.key_.weight", "blocks.34.attn.key_.bias", "blocks.34.attn.value_.weight", "blocks.34.attn.value_.bias", "blocks.35.attn.query_.weight", "blocks.35.attn.query_.bias", "blocks.35.attn.key_.weight", "blocks.35.attn.key_.bias", "blocks.35.attn.value_.weight", "blocks.35.attn.value_.bias", "blocks.36.attn.query_.weight", "blocks.36.attn.query_.bias", "blocks.36.attn.key_.weight", "blocks.36.attn.key_.bias", "blocks.36.attn.value_.weight", "blocks.36.attn.value_.bias", "blocks.37.attn.query_.weight", "blocks.37.attn.query_.bias", "blocks.37.attn.key_.weight", "blocks.37.attn.key_.bias", "blocks.37.attn.value_.weight", "blocks.37.attn.value_.bias", "blocks.38.attn.query_.weight", "blocks.38.attn.query_.bias", "blocks.38.attn.key_.weight", "blocks.38.attn.key_.bias", "blocks.38.attn.value_.weight", "blocks.38.attn.value_.bias", "blocks.39.attn.query_.weight", "blocks.39.attn.query_.bias", "blocks.39.attn.key_.weight", "blocks.39.attn.key_.bias", "blocks.39.attn.value_.weight", "blocks.39.attn.value_.bias", "blocks.40.attn.query_.weight", "blocks.40.attn.query_.bias", "blocks.40.attn.key_.weight", "blocks.40.attn.key_.bias", "blocks.40.attn.value_.weight", "blocks.40.attn.value_.bias", "blocks.41.attn.query_.weight", "blocks.41.attn.query_.bias", "blocks.41.attn.key_.weight", "blocks.41.attn.key_.bias", "blocks.41.attn.value_.weight", "blocks.41.attn.value_.bias", "blocks.42.attn.query_.weight", "blocks.42.attn.query_.bias", "blocks.42.attn.key_.weight", "blocks.42.attn.key_.bias", "blocks.42.attn.value_.weight", "blocks.42.attn.value_.bias", "blocks.43.attn.query_.weight", "blocks.43.attn.query_.bias", "blocks.43.attn.key_.weight", "blocks.43.attn.key_.bias", "blocks.43.attn.value_.weight", "blocks.43.attn.value_.bias", "blocks.44.attn.query_.weight", "blocks.44.attn.query_.bias", "blocks.44.attn.key_.weight", "blocks.44.attn.key_.bias", "blocks.44.attn.value_.weight", "blocks.44.attn.value_.bias", "blocks.45.attn.query_.weight", "blocks.45.attn.query_.bias", "blocks.45.attn.key_.weight", "blocks.45.attn.key_.bias", "blocks.45.attn.value_.weight", "blocks.45.attn.value_.bias", "blocks.46.attn.query_.weight", "blocks.46.attn.query_.bias", "blocks.46.attn.key_.weight", "blocks.46.attn.key_.bias", "blocks.46.attn.value_.weight", "blocks.46.attn.value_.bias", "blocks.47.attn.query_.weight", "blocks.47.attn.query_.bias", "blocks.47.attn.key_.weight", "blocks.47.attn.key_.bias", "blocks.47.attn.value_.weight", "blocks.47.attn.value_.bias". 
        Unexpected key(s) in state_dict: "blocks.0.self_attn.qkv_proj.weight", "blocks.0.self_attn.qkv_proj.bias", "blocks.1.self_attn.qkv_proj.weight", "blocks.1.self_attn.qkv_proj.bias", "blocks.2.self_attn.qkv_proj.weight", "blocks.2.self_attn.qkv_proj.bias", "blocks.3.self_attn.qkv_proj.weight", "blocks.3.self_attn.qkv_proj.bias", "blocks.4.self_attn.qkv_proj.weight", "blocks.4.self_attn.qkv_proj.bias", "blocks.5.self_attn.qkv_proj.weight", "blocks.5.self_attn.qkv_proj.bias", "blocks.6.self_attn.qkv_proj.weight", "blocks.6.self_attn.qkv_proj.bias", "blocks.7.self_attn.qkv_proj.weight", "blocks.7.self_attn.qkv_proj.bias", "blocks.8.self_attn.qkv_proj.weight", "blocks.8.self_attn.qkv_proj.bias", "blocks.9.self_attn.qkv_proj.weight", "blocks.9.self_attn.qkv_proj.bias", "blocks.10.self_attn.qkv_proj.weight", "blocks.10.self_attn.qkv_proj.bias", "blocks.11.self_attn.qkv_proj.weight", "blocks.11.self_attn.qkv_proj.bias", "blocks.12.self_attn.qkv_proj.weight", "blocks.12.self_attn.qkv_proj.bias", "blocks.13.self_attn.qkv_proj.weight", "blocks.13.self_attn.qkv_proj.bias", "blocks.14.self_attn.qkv_proj.weight", "blocks.14.self_attn.qkv_proj.bias", "blocks.15.self_attn.qkv_proj.weight", "blocks.15.self_attn.qkv_proj.bias", "blocks.16.self_attn.qkv_proj.weight", "blocks.16.self_attn.qkv_proj.bias", "blocks.17.self_attn.qkv_proj.weight", "blocks.17.self_attn.qkv_proj.bias", "blocks.18.self_attn.qkv_proj.weight", "blocks.18.self_attn.qkv_proj.bias", "blocks.19.self_attn.qkv_proj.weight", "blocks.19.self_attn.qkv_proj.bias", "blocks.20.self_attn.qkv_proj.weight", "blocks.20.self_attn.qkv_proj.bias", "blocks.21.self_attn.qkv_proj.weight", "blocks.21.self_attn.qkv_proj.bias", "blocks.22.self_attn.qkv_proj.weight", "blocks.22.self_attn.qkv_proj.bias", "blocks.23.self_attn.qkv_proj.weight", "blocks.23.self_attn.qkv_proj.bias", "blocks.24.self_attn.qkv_proj.weight", "blocks.24.self_attn.qkv_proj.bias", "blocks.25.self_attn.qkv_proj.weight", "blocks.25.self_attn.qkv_proj.bias", "blocks.26.self_attn.qkv_proj.weight", "blocks.26.self_attn.qkv_proj.bias", "blocks.27.self_attn.qkv_proj.weight", "blocks.27.self_attn.qkv_proj.bias", "blocks.28.self_attn.qkv_proj.weight", "blocks.28.self_attn.qkv_proj.bias", "blocks.29.self_attn.qkv_proj.weight", "blocks.29.self_attn.qkv_proj.bias", "blocks.30.self_attn.qkv_proj.weight", "blocks.30.self_attn.qkv_proj.bias", "blocks.31.self_attn.qkv_proj.weight", "blocks.31.self_attn.qkv_proj.bias", "blocks.32.self_attn.qkv_proj.weight", "blocks.32.self_attn.qkv_proj.bias", "blocks.33.self_attn.qkv_proj.weight", "blocks.33.self_attn.qkv_proj.bias", "blocks.34.self_attn.qkv_proj.weight", "blocks.34.self_attn.qkv_proj.bias", "blocks.35.self_attn.qkv_proj.weight", "blocks.35.self_attn.qkv_proj.bias", "blocks.36.self_attn.qkv_proj.weight", "blocks.36.self_attn.qkv_proj.bias", "blocks.37.self_attn.qkv_proj.weight", "blocks.37.self_attn.qkv_proj.bias", "blocks.38.self_attn.qkv_proj.weight", "blocks.38.self_attn.qkv_proj.bias", "blocks.39.self_attn.qkv_proj.weight", "blocks.39.self_attn.qkv_proj.bias", "blocks.40.self_attn.qkv_proj.weight", "blocks.40.self_attn.qkv_proj.bias", "blocks.41.self_attn.qkv_proj.weight", "blocks.41.self_attn.qkv_proj.bias", "blocks.42.self_attn.qkv_proj.weight", "blocks.42.self_attn.qkv_proj.bias", "blocks.43.self_attn.qkv_proj.weight", "blocks.43.self_attn.qkv_proj.bias", "blocks.44.self_attn.qkv_proj.weight", "blocks.44.self_attn.qkv_proj.bias", "blocks.45.self_attn.qkv_proj.weight", "blocks.45.self_attn.qkv_proj.bias", "blocks.46.self_attn.qkv_proj.weight", "blocks.46.self_attn.qkv_proj.bias", "blocks.47.self_attn.qkv_proj.weight", "blocks.47.self_attn.qkv_proj.bias". 
        size mismatch for embed.word_embeddings.weight: copying a param with shape torch.Size([12568, 7168]) from checkpoint, the shape in current model is torch.Size([50272, 7168]).
        size mismatch for blocks.0.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.0.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.0.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.0.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.1.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.1.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.1.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.1.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.2.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.2.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.2.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.2.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.3.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.3.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.3.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.3.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.4.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.4.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.4.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.4.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.5.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.5.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.5.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.5.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.6.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.6.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.6.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.6.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.7.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.7.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.7.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.7.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.8.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.8.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.8.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.8.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.9.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.9.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.9.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.9.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.10.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.10.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.10.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.10.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.11.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.11.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.11.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.11.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.12.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.12.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.12.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.12.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.13.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.13.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.13.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.13.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.14.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.14.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.14.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.14.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.15.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.15.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.15.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.15.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.16.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.16.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.16.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.16.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.17.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.17.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.17.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.17.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.18.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.18.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.18.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.18.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.19.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.19.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.19.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.19.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.20.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.20.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.20.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.20.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.21.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.21.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.21.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.21.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.22.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.22.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.22.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.22.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.23.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.23.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.23.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.23.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.24.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.24.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.24.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.24.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.25.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.25.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.25.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.25.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.26.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.26.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.26.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.26.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.27.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.27.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.27.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.27.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.28.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.28.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.28.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.28.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.29.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.29.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.29.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.29.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.30.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.30.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.30.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.30.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.31.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.31.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.31.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.31.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.32.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.32.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.32.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.32.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.33.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.33.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.33.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.33.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.34.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.34.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.34.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.34.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.35.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.35.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.35.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.35.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.36.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.36.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.36.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.36.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.37.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.37.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.37.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.37.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.38.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.38.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.38.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.38.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.39.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.39.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.39.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.39.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.40.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.40.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.40.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.40.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.41.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.41.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.41.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.41.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.42.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.42.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.42.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.42.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.43.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.43.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.43.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.43.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.44.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.44.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.44.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.44.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.45.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.45.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.45.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.45.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.46.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.46.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.46.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.46.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for blocks.47.attn.dense.weight: copying a param with shape torch.Size([7168, 896]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.47.mlp.dense_1.weight: copying a param with shape torch.Size([3584, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.47.mlp.dense_1.bias: copying a param with shape torch.Size([3584]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.47.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 3584]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
        size mismatch for head.dense.weight: copying a param with shape torch.Size([12568, 3584]) from checkpoint, the shape in current model is torch.Size([50272, 3584]).
Load file time: 13.155 s
load 4 files using 1 procs
Load file time: 13.153 s
zhengmk321 commented 1 year ago

Just realized that it does not support the opt ckpt files provided by meta. If anyone facing the same issue, use the ckpt files here. At least, it resolves the tensor size mismatch for opt_6.7B model.