Segmentation fault while training on a new language

bharathraj-v commented 7 months ago

Hi,

I modified the pflow/text/symbols.py, pflow/text/cleaners.py and configs/model/pflow.yaml similarly to what I had done in MatchaTTS to make the repo work with an Indian language.

But the training keeps crashing with a Segmentation fault (core dumped) error after a certain no. of epochs. The epoch at which segfault is happening is varying depending on the batch size, with higher batch sizes crashing quicker. I'm training with 20 num_workers on a NVIDIA A100 80GB. The batch size I last used was 14 which crashed at epoch 23, previously 17 batch size crashed at epoch 8, and any higher batch size than that was going OOM.

Any guidance regarding this issue would be of great help.

Thank you!

bharathraj-v commented 7 months ago

The tensorboard for the training that crashed at 23 ep: https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/c6ab1c7d-4f91-442b-841f-7f15971e7e81

bharathraj-v commented 7 months ago

My CUDA version/drivers

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+

p0p4k commented 7 months ago

What happens when num workers is 0 or 1 in the dataloader?

bharathraj-v commented 7 months ago

Doesn't seem to make any difference. The same segfault error occurred with 0 num_workers

lexkoro commented 7 months ago

Might be a wild guess, but you could try using the numpy version of maximum_path search: https://github.com/coqui-ai/TTS/blob/dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e/TTS/tts/utils/helpers.py#L197

Long time ago I also had problems with seg faults, running the training with gdb showed that it was related to maximum_path and using the numpy version fixed it.

lexkoro commented 7 months ago

That's not what I suggested...

bharathraj-v commented 7 months ago

Apologies, I totally misunderstood what you were suggesting. I modified the maximum_path() function in pflow/models/components/aligner.py replacing its logic with that of maximum_path_numpy() from coqui-ai/TTS and tried again and the error still persists.

I'm not sure if that's the change I was supposed to make, if that's incorrect please let me know. Also, can you point to how I can run this training with gdb so I can try that?

p0p4k commented 7 months ago

loop through your files and check if any audio file is empty.

p0p4k commented 7 months ago

also try ljspeech, if it works, means problem in dataset.

Tera2Space commented 7 months ago

Apologies, I totally misunderstood what you were suggesting. I modified the maximum_path() function in pflow/models/components/aligner.py replacing its logic with that of maximum_path_numpy() from coqui-ai/TTS and tried again and the error still persists.

I'm not sure if that's the change I was supposed to make, if that's incorrect please let me know. Also, can you point to how I can run this training with gdb so I can try that?

you need to use maximum_path_numpy in pflow/models/pflow_tts in line 152

patriotyk commented 7 months ago

Looks like this issue is duplicate of #24 Also I just noticed that in this repository there is two implementations of MAS, first one here pflow/models/components/aligner.py And second one written on Cython. And the second one that is written on cython crashes.

So you should replace it with pytorch or numpy implementations in this line https://github.com/p0p4k/pflowtts_pytorch/blob/e700677ddef9f2ff3895c342aa0698e44baaec04/pflow/models/pflow_tts.py#L152

bharathraj-v commented 7 months ago

Hi, apologies for the late response. I changed

from pflow.utils.monotonic_align import maximum_path
attn = (
        maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
)

to

from pflow.models.components.aligner import maximum_path
attn = (
        maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
)

The crashes stopped but the model, even at ep 200,l is synthesizing indecipherable noise. I tried this with both Ljspeech and the custom dataset I used to train matcha before and for both cases, the modification made the model synthesize gibberish. Also, the cython crashes are not happening with LJspeech, just the custom dataset.

Tensorboard for ljspeech with the modified maximum_path(): https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/229343af-33ae-4d25-9a5c-aa6296f71c3d

p0p4k commented 7 months ago

If cython crashes do not happen with ljspeech, make sure non of your audio in the custom dataset is of 0 length. Take a very small subset of the custom dataset (50 samples) and train the model on that. See if cython still crashes.

patriotyk commented 7 months ago

This will not fix the problem, we tried such thing. Also it crashes not every epoch, so if there is zero length files and they cause crash it should crash when pflow process it. But we also tried to logging to terminal batch that crashes and remove it from dataset for the next time. And it didn't help. It looks like it crashes randomly, and only with big datasets. I was able to train it without modifications and without crashes on 45 hours dataset. But on 300 hours it start to crash. @bharathraj-v What size is your dataset? Also it is very strange that implementation in pflow/models/components/aligner.py doesn't work. Maybe it returns result in different format? But what I now that definitely works it is our implementation on numba jit aligner_jit.txt

bharathraj-v commented 6 months ago

@p0p4k, with the cython implementation of maximum_path, it's not crashing with low samples (60 training and 20 val). The dataset I'm using is around 25 hours, All of the data is > 2s in duration.

@patriotyk, with maximum_path_jit, the training stopped crashing and the model is learning. It's taking 3.5min per epoch and is currently at 57 epochs. I will update the thread again in case it crashes, else I'll close the issue.

May I ask what the recommended amount of epochs is to train with this data size? Also, unrelated to the issue, what are the key takeaways for vits2 vs pflow?

Thanks!

p0p4k commented 6 months ago

Pflow is diffusion generalized. I think it's better than vits.

bharathraj-v commented 6 months ago

The training is not going past 1730 epochs with the validation step after that epoch crashing with this error -

  File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/models/baselightningmodule.py", line 224, in on_validation_end
    output = self.synthesise(x[:, :x_lengths], x_lengths, prompt=prompt_slice, n_timesteps=10, guidance_scale=0.0)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/pflow_tts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/models/pflow_tts.py", line 93, in synthesise
    y_mask = sequence_mask(y_lengths, y_max_length_).unsqueeze(1).to(x_mask.dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/utils/model.py", line 10, in sequence_mask
    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: upper bound and larger bound inconsistent with step sign

I think the y_maxlength is being assigned with a negative value. The training so far seems to have gone well but it's crashing now, I'm looking to train for more epochs. Any help regarding this error?

lexkoro commented 6 months ago

@bharathraj-v did you check tensorboard?

bharathraj-v commented 6 months ago

@bharathraj-v did you check tensorboard?

Yes, Here's the tensorboard: https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/543831fd-f737-455e-9f8c-c7c9a8b45229 Not sure why the epoch logs are only till 400 whereas the rest of the logs are for all 1730 epochs.

lexkoro commented 6 months ago

@bharathraj-v Your training collapsed since you have NaN values.

p0p4k / pflowtts_pytorch

Segmentation fault while training on a new language #44