Closed yangyyt closed 7 months ago
Our models support multi-GPU training. Specifically, when executing the run.sh
script to train models, you can set --gpu "0,1,2,3"
to utilize GPUs 0-3 for multi-GPU training. For example, if you want to train VALLE, please refer to the recipe under egs/tts/VALLE
. The same applies to other models; please refer to the recipe under egs/task/model
.
It does support multi-GPU training, but I need multiple machines for training, because sometimes a node does not have that many GPUs. How can I do, accelerate launch .......
To use accelerate for multi-machine, multi-GPU scenarios, you need to configure it through the accelerate config command and then launch your training script with the accelerate launch command.
Step 1: Use the accelerate config
command to create a configuration file for a multi-machine, multi-GPU setup. Follow the interactive prompts to specify:
Step 2: Launch your training script across the specified hardware
Remove CUDA_VISIBLE_DEVICES=$gpu
from this line in the script, and then execute run.sh
following the provided recipe.
when I run as follow: accelerate launch --machine_rank 0 \ --num_processes 16 \ --num_machines 4 \ --main_process_ip xxxxx \ --main_process_port xxxxx \ train.py but somthing wrong, socket timeout.
Could you please add step-by-step printing functions to help us understand where your program gets stuck?
Dr. Xue, @lmxue, could you assist with following up on this issue?
Thanks for your reply, I just don't know how to implement multi-machine training based on accelerate. Single-machine/single-node multi-gpus training is normal, like the above multi-machine training command will appear socket timeout problem. So is there a problem with accelerate launch commmands? Seek the author's opinion.
From my viewpoint, this appears to be a frequent issue encountered during training with Accelerate. Perhaps you could consult https://github.com/huggingface/accelerate/issues/314 for more insights.
@HarryHe11 Thanks much! Problem solved, use as follows,
accelerate launch --config_file default_config.yaml \
--main_process_ip ${xx} \
--main_process_port ${xx} \
--machine_rank ${xx} \
--num_processes ${xx} \
--num_machines ${xx} \
train.py
default_config.yaml:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: 10.132.224.0
main_process_port: 24
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 4
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
The previous machine_rank passed error, as for other parameters can refer to the current use, now normal training.
Problem Overview
(Briefly and clearly describe the issue you're facing and seeking help with.)
Steps Taken
(Detail your attempts to resolve the issue, including any relevant steps or processes.)
Expected Outcome
(A clear and concise description of what you expected to happen.)
Screenshots
(If applicable, add screenshots to help explain your problem.)
Environment Information
Additional context
(Add any other context about the problem here.)