open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

[Help]: MultiGPU TTA training #159

Open fpicetti opened 5 months ago

fpicetti commented 5 months ago

Problem Overview

I'd like to train a TTA model (following your examples) in a multiGPU environment (i.e., 4 A100) but I have been unsuccessful so far.

Steps Taken

  1. prepared AudioCaps dataset
  2. fix typos in the base config files for both autoencoderkl and audioldm folders
  3. updated json and sh files according with my dataset
  4. launched the train script with sh egs/tta/autoencoderkl/run_train.sh, no further modification -> it works on the first GPU, as expected
  5. modified run_train.sh#L19 as `export CUDA_VISIBLE_DEVICES="0,1,2,3" -> it works on the first GPU only
  6. keeping point 4, also changed exp_config.json#L38 to "ddp": true -> fails, it asks for all the distribution parameters (RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT)
  7. reverted 4 and 5, and thought to leverage accelerate: run accelerate config to set up a single node multiGPU training. accelerate test works fine on the 4 GPUs.
  8. Removed run_train.sh#L19 and modified run_train.sh#L22 to accelerate launch "${work_dir}"/bins/tta/train_tta.py -> I see 4 processes on the first GPU, then it goes OOM.

Expected Outcome

A single train job on 4 GPUs.

Environment Information

fpicetti commented 5 months ago

@HeCheng0625 any update on this?

HeCheng0625 commented 4 months ago

Hi, TTA now only supports single GPU training, you can refer to other tasks to implement multi-card training based on accelerate. Welcome to submit PRs.

hieuhthh commented 2 months ago

Any plan on support training multi GPU on TTA task yet.