Open Santosh-Gupta opened 3 years ago
I met the same problem, have you find any solutions?
I trained and save_checkpoint
using 4 GPUs, but when I tried to load_checkpoint
using 1 GPU, I encountered the same issue.
I suspect that ZeRO3 splits the model and saves the weights, as 882700288/220675072=4 is consistent with my case.
During training, I would periodically save a checkpoint using
model_engine.save_checkpoint
However,
model_engine.load_checkpoint
is resulting in this outputThis is the main code I use for training
And this was the code where I load a checkpoint. I am using a jupyter notebook, so the output printed above only contains the output for this line.
Here is the full code, which is basically the same as the training code, plus the checkpoint loading, and also commenting out
deepspeed.init_distributed(dist_backend='nccl')
, which results in an error, probably from not using a deepspeed launcher, but I don't believe it's necessary since I am only doing inference,