hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
630 stars 90 forks source link

torch.load() hangs indefinitely when reading OPT pre-trained model weights #159

Open larry-fuy opened 2 years ago

larry-fuy commented 2 years ago

I'm trying to use OPT 66B pre-trained model for inference on EnergonAI. After preprocessing the weights by the script of preprocessing_ckpt_66b.py and starting opt server, the service hangs there when loading the weights. I tracked back and found it hangs on torch.load() after reading most of weight files (95% weights are loaded). https://github.com/hpcaitech/EnergonAI/blob/98a12bc2107b206017c4793380538f9cdec5a5e1/energonai/utils/checkpointing.py#L42 The output before hanging

start loading /root/EnergonAI/ckpt/opt_66b/14-restored.pt...
                    INFO     colossalai - energonai - INFO: Rank1/0 model size = 17.395826688 GB                                                                                 
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 2 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank2/0 model size = 17.395826688 GB                                                                                 
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 5 built layer 0-64 / total 64                                                                               
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 7 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: ==> Rank 6 built layer 0-64 / total 64                                                                               
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 4 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank5/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: Rank7/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank4/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: Rank6/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: ==> Rank 3 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank3/0 model size = 17.395826688 GB                        

By the way, if I run only the code block around torch.load() locally and all weights could be loaded successfully through torch.load().

ver217 commented 2 years ago

Hi, could you try the latest code?