Failed to load OPT-30B checkpoint

ericxsun commented 1 year ago

System

GPU: Tesla V100 (32G)
Cuda: 11.3
Pytorch: torch==1.12.1+cu113
ColossalAI: 0.1.12+torch1.12cu11.3
EnergonAI: master

OPT-30B

checkpoints can be found here: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT

OPT-30B	30B	part0, part1

Start fastapi

cd EnergonAI/examples/opt

CUDA_VISIBLE_DEVICES=0,1 CUDA_HOME=/usr/local/cuda-11.3 LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64 python opt_fastapi.py opt-30b --tp 2

we got the following logs

[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:19990 (errno: 99 - Cannot assign requested address).
[12/29/22 16:23:12] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:521       
                             set_device                                                                                                  
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                         
[12/29/22 16:23:12] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:521       
                             set_device                                                                                                  
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                         
[12/29/22 16:23:14] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:557       
                             set_seed                                                                                                    
[12/29/22 16:23:14] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:557       
                             set_seed                                                                                                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,               
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.          
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,               
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is ParallelMode.DATA.          
                    INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/initialize.py:117 launch              
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline     
                             parallel size: 1, tensor parallel size: 2                                                                   
[12/29/22 16:23:17] INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:195             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: ==> Rank 0 built layer 0-48 / total 48                                       
                    INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:200             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: Rank0/0 model size = 30.7120128 GB                                           
load 2 files using 1 procs
[12/29/22 16:23:17] INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:195             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: ==> Rank 1 built layer 0-48 / total 48                                       
                    INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:200             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: Rank1/0 model size = 30.7120128 GB                                           
Load file time: 42.683 s
Load file time: 42.661 s

then about 10minutes later, the error occurred:

size mismatch

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        size mismatch for blocks.0.attn.dense.weight: copying a param with shape torch.Size([7168, 1792]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.0.mlp.dense_1.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.0.mlp.dense_1.bias: copying a param with shape torch.Size([7168]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.0.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).

Missing keys in state_dict:

RuntimeError: Error(s) in loading state_dict for PipelineModel:
        Missing key(s) in state_dict: "blocks.0.attn.query_.weight", "blocks.0.attn.query_.bias", "blocks.0.attn.key_.weight", "blocks.0.attn.key_.bias", "blocks.0.attn.value_.weight", "blocks.0.attn.value_.bias", "

Unexpected key(s) in state_dict:

Unexpected key(s) in state_dict: "blocks.0.self_attn.qkv_proj.weight", "blocks.0.self_attn.qkv_proj.bias", "blocks.1.self_attn.qkv_proj.weight", "blocks.1.self_attn.qkv_proj.bias", "blocks.2.self_attn.qkv_proj.weight",

So what's wrong? Or any pre-processing should be done like 66B/175B?

Thanks you so much

ver217 commented 1 year ago

Hi, you may try these weights https://huggingface.co/facebook/opt-30b/tree/main