huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.43k stars 885 forks source link

Need to do distributed training using 2 separate machines #924

Open Sreyashi-Bhattacharjee opened 1 year ago

Sreyashi-Bhattacharjee commented 1 year ago

Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:

Machine 1:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 0 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Machine 2:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 1 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.

muellerzr commented 1 year ago

@Sreyashi-Bhattacharjee this is currently unsupported yet for multi-CPU, changing to a FR and adding it to our timetable

jav-ed commented 6 months ago

Any update on this?