microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.34k stars 4.1k forks source link

Models with unused parameters hang with ZeRO stage 2 (iGPT) #777

Closed SeanNaren closed 3 years ago

SeanNaren commented 3 years ago

I've made an issue for this on the PL side with reproducibility here https://github.com/PyTorchLightning/pytorch-lightning/issues/6064. This model also fails to train in DDP model if the find_unused_parameters flag is set to False.

This case is similar to a fix that was added into FairScale for ShardedDDP to handle this case: https://github.com/facebookresearch/fairscale/pull/223 where eventually a test for this model was added.

It seems there are certain cases where even if a variable has been market as requiring gradients, autograd does not fire the hook. This means the code hangs upon the first successful step, as we await gradients to be reduced within a bucket (atleast that's my understand).

More than happy to assist in fixing this, if someone from the DeepSpeed team can help me figure out where to look!

cc @stas00 who may have run into this issue

stas00 commented 3 years ago

I haven't run into this (yet?), perhaps since we have find_unused_parameters turned on by default under DDP, but can be overriden by users.

SeanNaren commented 3 years ago

I haven't run into this (yet?), perhaps since we have find_unused_parameters turned on by default under DDP, but can be overriden by users.

Definitely a good thing! The deepspeed engine iirc handles all communication so there's no DDP wrapper internally since they call all the functions. Need to double check though

stas00 commented 3 years ago

Yes, I was commenting only on the part of your comment where you were using DDP.

Originally, I was wrapping DeepSpeed in DDP, until I realized that this wasn't needed at all, so it was removed.

As DeepSpeed is quite self-contained it's actually more difficult to correctly integrate it into a trainer that handles many different other engines, since one needs to remember to skip many functions when DeepSpeed is enabled. e.g. my most recent discovery was not to run lr_scheduler.step(), as it already does it as part of optimizer.step().

SeanNaren commented 3 years ago

This was fixed in later DeepSpeed versions :)