open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.41k stars 373 forks source link

[BUG]: FastSpeech2 train failed when under cuda_devices="1,2,3" #167

Closed huangxu1991 closed 4 months ago

huangxu1991 commented 5 months ago

Errors as follows:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures: [1]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 1 (local_rank: 1) exitcode : -11 (pid: 45620) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45620 [2]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 2 (local_rank: 2) exitcode : -11 (pid: 45621) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45621

Root Cause (first observed failure): [0]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 0 (local_rank: 0) exitcode : -11 (pid: 45619) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45619

However, it's fine when use only one gpu!

lmxue commented 5 months ago

Hi @huangxu1991 , the error you're encountering, indicated by Signal 11 (SIGSEGV), is a segmentation fault that occurs at the operating system level when a program attempts to access memory that it's not allowed to access. This error can be challenging to debug, especially in distributed training scenarios on multiple GPUs. However, given that your training runs fine on a single GPU but fails when scaling to multiple GPUs, here are several steps and considerations to help you resolve this issue:

  1. Ensure Environment Consistency Across GPUs
    • CUDA and cuDNN Versions: Make sure all GPUs are running the same versions of CUDA and cuDNN. Inconsistencies can lead to unexpected behaviors.
    • PyTorch Version: Confirm that your PyTorch version is compatible with your CUDA and cuDNN versions.
  2. Check GPU Resources
    • Memory Check: Ensure each GPU has enough memory to handle the model and data. Training on multiple GPUs increases memory requirements.
    • Compute Capability: Verify that all GPUs have the necessary compute capability for your specific training tasks.
RMSnow commented 4 months ago

Hi @huangxu1991, if you have any further questions, feel free to re-open this issue. We are glad to follow up!

huangxu1991 commented 4 months ago

ok