Open zhufeijuanjuan opened 1 year ago
@zhufeijuanjuan did you solve the problem? I've encountered the same problem
@elch10 Not handled yet. Decrease batch size can avoid this issue.
Maybe problem with mp.spawn
or pytorch installation
I've created new environment with python3.11, pytorch2, because python3.10 has many sigsegv problems in pytorch repository. Then changed this line with
children = []
for i in range(n_gpus):
subproc = mp.Process(target=run, args=(i, n_gpus, hps))
children.append(subproc)
subproc.start()
for i in range(n_gpus):
children[i].join()
And now there's no errors
Also found with python -X faulthandler
that there is error in cython monotonic_align. I just rewrite that in numba. It's slightly slower, but works without error
@elch10 Respect!!! so you mean update python, pytorch and rewrite monotonic_align can solve this issue?
Yeah, of course, no errors
segmentation fault appears after train a few steps when batch size >16, everything is ok when batch size <= 16. same issue exist when using dp training.
I use faulthander to track core dump happen in which step, then the logs shows that as follows: Current thread 0x00007fba40fe94c0 (most recent call first): File "xx/TTS/monotonic_align/init.py", line 40 in maximum_path File "xx/TTS//models/models.py", line 822 in forward File "xx/miniconda3/envs/pytorch2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl File "train_multilang_speaker_1gpu.py", line 175 in train_and_evaluate File "train_multilang_speaker_1gpu.py", line 142 in run File "train_multilang_speaker_1gpu.py", line 50 in main File "train_multilang_speaker_1gpu.py", line 407 in
It seems related to monotonic_align, did anyone can help to solve it? Thanks.