Closed fwyc0573 closed 9 months ago
cc @pacman100
Hello, FSDP works on Multi-Node setup, we have trained 70B llama model on 2 nodes with 8 A100 GPUs each. For more details on it, you can refer the blogpost https://huggingface.co/blog/ram-efficient-pytorch-fsdp
Hello, FSDP works on Multi-Node setup, we have trained 70B llama model on 2 nodes with 8 A100 GPUs each. For more details on it, you can refer the blogpost https://huggingface.co/blog/ram-efficient-pytorch-fsdp
Thank you! This is a great blog post; let me first try to reproduce it.
I am encountering an issue where I cannot conduct training on multiple nodes using the provided FSDP example, as the process gets blocked.
My environment consists of:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I have been using the script from the example/nlp_example.py in the accelerate project, with a modified batch_size set to 1.
accelerate config:
The configuration for the other node has been modified to machine_rank: 1, with all other settings remaining the same.
Below are the commands I used to launch the training:
Expected behavior
I am able to successfully run the FSDP mode on a single node, as well as DDP mode training (muti-nodes) using torchrun (with the same script).
However, I am unable to successfully conduct FSDP training across multiple nodes. I have pinpointed that the script gets stuck at the outputs = model(batch)** line (debugging revealed that both nodes are blocked and unable to proceed beyond this line).
Does accelerate currently support multi-node FSDP training, or is there an issue with my configuration? BTW, is my issue consistent with this one?#2011 Dear @muellerzr @SunMarc, could I have your assistance, please? Thank you!