Open afiaka87 opened 3 years ago
seems my checkpoint wasn't being saved properly. Maybe this is related to deepspeed requiring you to use their methods to load and save pytorch models?
Yeah, we can't avoid that now if we want to support offloading and partitioning. I'll fix it.
@janEbert Thanks!
@lucidrains I'd rework checkpointing for this so we always (no matter whether distributed or not) save checkpoints resumable for training (which includes the optimizer state) instead of just checkpoints for inference which is the current behavior.
Is that fine with you or would you rather like to keep the old behavior? I could work around it but it would be less clean.
EDIT: Actually nevermind, it's not as cleanly solvable as I thought. Would still suggest you think about whether you'd like to save/restore the optimizer state, though!
For now, you can use the janEbert/deepspeed
branch which has temporary fixes for partitioned models. (The calls may still fail if the VAE is partitioned.)
I realized some underlying issues as well, this is why I don't PR the fix. We'll need to split out the VAE from the DALLE model. You can't load a DeepSpeed checkpoint of the VAE, either, for the DALLE model, because you can't "merge" the VAE into it.
@lucidrains can we get your eyeballs on this? Could use guidance on how to store optimizer state etc.
The possible issue I'm seeing is that DeepSpeed does not handle multiple ZeRO-enabled models at once for whatever reason (e.g. both wanting to take all GPU memory, unhandled shared global state, ...). Not sure, though, I'd need to look into it. If, only if, that's the case, splitting the models wouldn't help either. I haven't figured out how to handle that case quite yet. :)
Partly solved by #231.
getting size mismatches on the entire checkpoint. This sort of thing.
whenever I resume from a checkpoint. Were the checkpoint keys change recently?