[Help]: How to train multiple machies in accelerate;

open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

https://openhlt.github.io/amphion/

MIT License

4.45k stars 379 forks source link

[Help]: How to train multiple machies in accelerate; #130

Closed yangyyt closed 7 months ago

yangyyt commented 7 months ago

Problem Overview

(Briefly and clearly describe the issue you're facing and seeking help with.)

Steps Taken

(Detail your attempts to resolve the issue, including any relevant steps or processes.)

Config/File changes: ...
Run command: ...
See errors: ...

Expected Outcome

(A clear and concise description of what you expected to happen.)

Screenshots

(If applicable, add screenshots to help explain your problem.)

Environment Information

Operating System: [e.g. Ubuntu 20.04.5 LTS]
Python Version: [e.g. Python 3.9.15]
Driver & CUDA Version: [e.g. Driver 470.103.01 & CUDA 11.4]
Error Messages and Logs: [If applicable, provide any error messages or relevant log outputs]

Additional context

(Add any other context about the problem here.)

lmxue commented 7 months ago

Our models support multi-GPU training. Specifically, when executing the run.sh script to train models, you can set --gpu "0,1,2,3" to utilize GPUs 0-3 for multi-GPU training. For example, if you want to train VALLE, please refer to the recipe under egs/tts/VALLE . The same applies to other models; please refer to the recipe under egs/task/model.

yangyyt commented 7 months ago

It does support multi-GPU training， but I need multiple machines for training, because sometimes a node does not have that many GPUs. How can I do, accelerate launch .......

lmxue commented 7 months ago

To use accelerate for multi-machine, multi-GPU scenarios, you need to configure it through the accelerate config command and then launch your training script with the accelerate launch command.

Step 1: Use the accelerate config command to create a configuration file for a multi-machine, multi-GPU setup. Follow the interactive prompts to specify:

The number of machines
Number of GPUs per machine
Mixed precision training option
Other relevant distributed training settings

Step 2: Launch your training script across the specified hardware Remove CUDA_VISIBLE_DEVICES=$gpu from this line in the script, and then execute run.sh following the provided recipe.

yangyyt commented 7 months ago

when I run as follow: accelerate launch --machine_rank 0 \ --num_processes 16 \ --num_machines 4 \ --main_process_ip xxxxx \ --main_process_port xxxxx \ train.py but somthing wrong, socket timeout.

HarryHe11 commented 7 months ago

Could you please add step-by-step printing functions to help us understand where your program gets stuck?

Dr. Xue, @lmxue, could you assist with following up on this issue?

yangyyt commented 7 months ago

Thanks for your reply, I just don't know how to implement multi-machine training based on accelerate. Single-machine/single-node multi-gpus training is normal, like the above multi-machine training command will appear socket timeout problem. So is there a problem with accelerate launch commmands? Seek the author's opinion.

HarryHe11 commented 7 months ago

From my viewpoint, this appears to be a frequent issue encountered during training with Accelerate. Perhaps you could consult https://github.com/huggingface/accelerate/issues/314 for more insights.

yangyyt commented 7 months ago

@HarryHe11 Thanks much! Problem solved, use as follows, accelerate launch --config_file default_config.yaml \ --main_process_ip ${xx} \ --main_process_port ${xx} \ --machine_rank ${xx} \
--num_processes ${xx} \ --num_machines ${xx} \ train.py

default_config.yaml: compute_environment: LOCAL_MACHINE debug: true distributed_type: MULTI_GPU
downcast_bf16: 'no' machine_rank: 0 main_process_ip: 10.132.224.0 main_process_port: 24 main_training_function: main mixed_precision: 'no' num_machines: 2 num_processes: 4 rdzv_backend: c10d same_network: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

The previous machine_rank passed error, as for other parameters can refer to the current use, now normal training.