Open m3rlin45 opened 2 years ago
I'm doing some debugging on my own, it seems like the proximate cause is that HuggingFace's T5 implementation does not include embed_tokens
in the return values for module.named_parameters()
, which means that DeepSpeed is oblivious to the embeddings.
I'm not sure why this is working at all, in that case, with part of the model being not sharded. Maybe the problem is only at save time?
@m3rlin45, thanks for sharing this issue and your analysis. A couple of thoughts.
embed_tokens
are shared parameters then it is possible that the parameters are still optimized and sharded, see #1896.embed_tokens
are not trainable then they should not be shared anyway. Do any of the above apply in this case? Thanks!
Describe the bug When HuggingFace T5 models are checkpointed, the embed_tokens modules in both the encoder and decoder are not included.
This is also captured in https://github.com/PyTorchLightning/pytorch-lightning/issues/10964
Expected behavior I expect all model parameters to be included in checkpoints
ds_report output
System info (please complete the following information): Single DGX-1 V100 with Ubunt 20.04.2 LTS Python 3.7
Launcher context Launching with accelerate
To Reproduce I have a simple script that I've used to reproduce this issue on a single DGX-1 node. I'm using hf accelerate to run it on all 8 GPUs, but any similar launcher should work