Need to do distributed training using 2 separate machines

Sreyashi-Bhattacharjee commented 1 year ago

Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:

Machine 1:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 0 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Machine 2:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 1 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.

muellerzr commented 1 year ago

@Sreyashi-Bhattacharjee this is currently unsupported yet for multi-CPU, changing to a FR and adding it to our timetable

jav-ed commented 6 months ago

Any update on this?

huggingface / accelerate

Need to do distributed training using 2 separate machines #924