🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:
Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.
Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:
Machine 1:
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 0 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true
Machine 2:
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 1 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true
Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.