jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.73k stars 1.24k forks source link

Problems about training with multiprocess #106

Open BrianWayland opened 1 year ago

BrianWayland commented 1 year ago

@jaywalnut310 My platform is RTX 3060 with pytorch 1.10.0+cu113. I found the following exception triggered when executing train_ms.py:

THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=710 : device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 0
Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f854408dd62 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f854408a68b in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x30a6c6e (0x7f85a136ec6e in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x113 (0x7f85a1357813 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f85a1357a39 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #5: <unknown function> + 0xe97556 (0x7f860b65d556 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xe7d085 (0x7f860b643085 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x2a35e8 (0x7f860aa695e8 in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2a48ee (0x7f860aa6a8ee in /environment/miniconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x13be28 (0x555afdef1e28 in /environment/miniconda3/bin/python)
frame #10: PyDict_Clear + 0x133 (0x555afdef1c43 in /environment/miniconda3/bin/python)
frame #11: <unknown function> + 0x13bc89 (0x555afdef1c89 in /environment/miniconda3/bin/python)
frame #12: <unknown function> + 0x163141 (0x555afdf19141 in /environment/miniconda3/bin/python)
frame #13: _PyGC_CollectNoFail + 0x2a (0x555afdf8600a in /environment/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x532 (0x555afdf61582 in /environment/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x6e (0x555afdf9f0de in /environment/miniconda3/bin/python)
frame #16: Py_Exit + 0x8 (0x555afdf9f1f8 in /environment/miniconda3/bin/python)
frame #17: <unknown function> + 0x1e92ae (0x555afdf9f2ae in /environment/miniconda3/bin/python)
frame #18: PyErr_PrintEx + 0x2c (0x555afdf9f2fc in /environment/miniconda3/bin/python)
frame #19: PyRun_SimpleStringFlags + 0x62 (0x555afdfa4b72 in /environment/miniconda3/bin/python)
frame #20: <unknown function> + 0x1eec4a (0x555afdfa4c4a in /environment/miniconda3/bin/python)
frame #21: _Py_UnixMain + 0x3c (0x555afdfa500c in /environment/miniconda3/bin/python)
frame #22: __libc_start_main + 0xf3 (0x7f860dcdc0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: <unknown function> + 0x1c24db (0x555afdf784db in /environment/miniconda3/bin/python)

Traceback (most recent call last):
  File "train_ms.py", line 295, in <module>
    main()
  File "train_ms.py", line 50, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/featurize/data/vits-main/train_ms.py", line 119, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/home/featurize/data/vits-main/train_ms.py", line 147, in train_and_evaluate
    (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths, speakers)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/models.py", line 467, in forward
    z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/models.py", line 237, in forward
    x = self.enc(x, x_mask, g=g)
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/featurize/data/vits-main/modules.py", line 166, in forward
    n_channels_tensor)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/featurize/data/vits-main/commons.py", line 103, in fused_add_tanh_sigmoid_multiply
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
  n_channels_int = n_channels[0]
  in_act = input_a + input_b
           ~~~~~~~~~~~~~~~~~ <--- HERE
  t_act = torch.tanh(in_act[:, :n_channels_int, :])
  s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seen that this problem happened because training on multiple GPUs. Since my platform has only one GPU, I don't know how this happened. Look forward to your kind reply! Thank you!

v-nhandt21 commented 1 year ago

Try to install another torch version based on your driver

Check here: https://pytorch.org/get-started/previous-versions/