Open jiangjiadi opened 1 month ago
Thanks for the reproducer, looking into it now
Hi @jiangjiadi I have spent some time looking into the issue and I was able to reproduce. Interestingly the script worked if you never init the model on the meta device.
Also note from the official pytorch docs:
As of PyTorch 1.12, FSDP only offers limited support for shared parameters (for example, setting one Linear layer’s weight to another’s). In particular, modules that share parameters must be wrapped as part of the same FSDP unit. If enhanced shared parameter support is needed for your use case, please ping https://github.com/pytorch/pytorch/issues/77724
I will keep investigating and let you know.
Hi @younesbelkada Thank you for looking into this issue. I appreciate your prompt response and I am looking forward to any updates.
Additionally, I've noticed that when the from_config
method is called with DeepSpeed's zero3 enabled, the model gets pre-partitioned. Could a similar approach be adopted for FSDP initialization? Pre-partitioning the model at definition could potentially help mitigate OOM issues when training large models.
cc @muellerzr @SunMarc
System Info
transformers
version: 4.41.2Who can help?
text model: @ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run Command:
torchrun --nproc_per_node 2 test_fsdp.py
Expected behavior
When tie_word_embeddings=False is set, the code behaves normally. However, when I set tie_word_embeddings=True, rank 0 exits normally, but rank 1 gets stuck. The point where it gets stuck is shown in the following image. (When using accelerate, the behavior is the same.)![image](https://github.com/huggingface/transformers/assets/34134495/7d51e69f-c846-4a9b-bea0-a0f455ddcd30)