segmentation fault after train a few steps

zhufeijuanjuan commented 1 year ago

segmentation fault appears after train a few steps when batch size >16, everything is ok when batch size <= 16. same issue exist when using dp training.

I use faulthander to track core dump happen in which step, then the logs shows that as follows: Current thread 0x00007fba40fe94c0 (most recent call first): File "xx/TTS/monotonic_align/init.py", line 40 in maximum_path File "xx/TTS//models/models.py", line 822 in forward File "xx/miniconda3/envs/pytorch2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501 in _call_impl File "train_multilang_speaker_1gpu.py", line 175 in train_and_evaluate File "train_multilang_speaker_1gpu.py", line 142 in run File "train_multilang_speaker_1gpu.py", line 50 in main File "train_multilang_speaker_1gpu.py", line 407 in

It seems related to monotonic_align, did anyone can help to solve it? Thanks.

elch10 commented 8 months ago

@zhufeijuanjuan did you solve the problem? I've encountered the same problem

zhufeijuanjuan commented 8 months ago

@elch10 Not handled yet. Decrease batch size can avoid this issue.

elch10 commented 8 months ago

Maybe problem with mp.spawn or pytorch installation I've created new environment with python3.11, pytorch2, because python3.10 has many sigsegv problems in pytorch repository. Then changed this line with

  children = []
  for i in range(n_gpus):
      subproc = mp.Process(target=run, args=(i, n_gpus, hps))
      children.append(subproc)
      subproc.start()

  for i in range(n_gpus):
      children[i].join()

And now there's no errors

elch10 commented 8 months ago

Also found with python -X faulthandler that there is error in cython monotonic_align. I just rewrite that in numba. It's slightly slower, but works without error

zhufeijuanjuan commented 8 months ago

@elch10 Respect!!! so you mean update python, pytorch and rewrite monotonic_align can solve this issue?

elch10 commented 8 months ago

Yeah, of course, no errors

jaywalnut310 / vits

segmentation fault after train a few steps #181