huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.42k stars 883 forks source link

【FSDP】 Unable to Train on Multiple Nodes Using the Provided FSDP Example, Experiencing Blockages #2428

Closed fwyc0573 closed 5 months ago

fwyc0573 commented 5 months ago

I am encountering an issue where I cannot conduct training on multiple nodes using the provided FSDP example, as the process gets blocked.

My environment consists of:

System Info


2 nodes, each equipped with 2 NVIDIA GeForce RTX 3090 GPUs
Ubuntu 20.04.6 LTS
PyTorch 2.1.0, Python 3.9.18, transformers 4.38.0.dev0, torchvision 0.16.0, accelerate 0.26.1

Information

Tasks

Reproduction

I have been using the script from the example/nlp_example.py in the accelerate project, with a modified batch_size set to 1.

accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: false
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_use_orig_params: true
machine_rank: 0
main_process_ip: .........
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The configuration for the other node has been modified to machine_rank: 1, with all other settings remaining the same.

Below are the commands I used to launch the training:

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
accelerate launch ./nlp_example.py

Expected behavior

I am able to successfully run the FSDP mode on a single node, as well as DDP mode training (muti-nodes) using torchrun (with the same script).

However, I am unable to successfully conduct FSDP training across multiple nodes. I have pinpointed that the script gets stuck at the outputs = model(batch)** line (debugging revealed that both nodes are blocked and unable to proceed beyond this line). Y_NK0UAG83P00GQ7$7Y$Y15

Does accelerate currently support multi-node FSDP training, or is there an issue with my configuration? BTW, is my issue consistent with this one?#2011 Dear @muellerzr @SunMarc, could I have your assistance, please? Thank you!

muellerzr commented 5 months ago

cc @pacman100

pacman100 commented 5 months ago

Hello, FSDP works on Multi-Node setup, we have trained 70B llama model on 2 nodes with 8 A100 GPUs each. For more details on it, you can refer the blogpost https://huggingface.co/blog/ram-efficient-pytorch-fsdp

fwyc0573 commented 5 months ago

Hello, FSDP works on Multi-Node setup, we have trained 70B llama model on 2 nodes with 8 A100 GPUs each. For more details on it, you can refer the blogpost https://huggingface.co/blog/ram-efficient-pytorch-fsdp

Thank you! This is a great blog post; let me first try to reproduce it.