[BUG]: FastSpeech2 train failed when under cuda_devices="1,2,3"

huangxu1991 commented 5 months ago

Errors as follows:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures: [1]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 1 (local_rank: 1) exitcode : -11 (pid: 45620) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45620 [2]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 2 (local_rank: 2) exitcode : -11 (pid: 45621) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45621

Root Cause (first observed failure): [0]: time : 2024-03-27_14:52:55 host : iv-ycucodpkaowuxjsftsvp rank : 0 (local_rank: 0) exitcode : -11 (pid: 45619) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 45619

However, it's fine when use only one gpu!

lmxue commented 5 months ago

Hi @huangxu1991 , the error you're encountering, indicated by Signal 11 (SIGSEGV), is a segmentation fault that occurs at the operating system level when a program attempts to access memory that it's not allowed to access. This error can be challenging to debug, especially in distributed training scenarios on multiple GPUs. However, given that your training runs fine on a single GPU but fails when scaling to multiple GPUs, here are several steps and considerations to help you resolve this issue:

Ensure Environment Consistency Across GPUs
- CUDA and cuDNN Versions: Make sure all GPUs are running the same versions of CUDA and cuDNN. Inconsistencies can lead to unexpected behaviors.
- PyTorch Version: Confirm that your PyTorch version is compatible with your CUDA and cuDNN versions.
Check GPU Resources
- Memory Check: Ensure each GPU has enough memory to handle the model and data. Training on multiple GPUs increases memory requirements.
- Compute Capability: Verify that all GPUs have the necessary compute capability for your specific training tasks.

RMSnow commented 4 months ago

Hi @huangxu1991, if you have any further questions, feel free to re-open this issue. We are glad to follow up!

huangxu1991 commented 4 months ago

ok

open-mmlab / Amphion

[BUG]: FastSpeech2 train failed when under cuda_devices="1,2,3" #167