Open larry-fuy opened 2 years ago
I'm trying to use OPT 66B pre-trained model for inference on EnergonAI. After preprocessing the weights by the script of preprocessing_ckpt_66b.py and starting opt server, the service hangs there when loading the weights. I tracked back and found it hangs on torch.load() after reading most of weight files (95% weights are loaded). https://github.com/hpcaitech/EnergonAI/blob/98a12bc2107b206017c4793380538f9cdec5a5e1/energonai/utils/checkpointing.py#L42 The output before hanging
preprocessing_ckpt_66b.py
torch.load()
start loading /root/EnergonAI/ckpt/opt_66b/14-restored.pt... INFO colossalai - energonai - INFO: Rank1/0 model size = 17.395826688 GB [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model INFO colossalai - energonai - INFO: ==> Rank 2 built layer 0-64 / total 64 INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model INFO colossalai - energonai - INFO: Rank2/0 model size = 17.395826688 GB [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model INFO colossalai - energonai - INFO: ==> Rank 5 built layer 0-64 / total 64 [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model INFO colossalai - energonai - INFO: ==> Rank 7 built layer 0-64 / total 64 INFO colossalai - energonai - INFO: ==> Rank 6 built layer 0-64 / total 64 [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model INFO colossalai - energonai - INFO: ==> Rank 4 built layer 0-64 / total 64 INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model INFO colossalai - energonai - INFO: Rank5/0 model size = 17.395826688 GB INFO colossalai - energonai - INFO: Rank7/0 model size = 17.395826688 GB INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model [10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model INFO colossalai - energonai - INFO: Rank4/0 model size = 17.395826688 GB INFO colossalai - energonai - INFO: Rank6/0 model size = 17.395826688 GB INFO colossalai - energonai - INFO: ==> Rank 3 built layer 0-64 / total 64 INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model INFO colossalai - energonai - INFO: Rank3/0 model size = 17.395826688 GB
By the way, if I run only the code block around torch.load() locally and all weights could be loaded successfully through torch.load().
Hi, could you try the latest code?
I'm trying to use OPT 66B pre-trained model for inference on EnergonAI. After preprocessing the weights by the script of
preprocessing_ckpt_66b.py
and starting opt server, the service hangs there when loading the weights. I tracked back and found it hangs ontorch.load()
after reading most of weight files (95% weights are loaded). https://github.com/hpcaitech/EnergonAI/blob/98a12bc2107b206017c4793380538f9cdec5a5e1/energonai/utils/checkpointing.py#L42 The output before hangingBy the way, if I run only the code block around
torch.load()
locally and all weights could be loaded successfully throughtorch.load()
.