Closed bharathraj-v closed 6 months ago
The tensorboard for the training that crashed at 23 ep: https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/c6ab1c7d-4f91-442b-841f-7f15971e7e81
My CUDA version/drivers
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
What happens when num workers is 0 or 1 in the dataloader?
Doesn't seem to make any difference. The same segfault error occurred with 0 num_workers
Might be a wild guess, but you could try using the numpy version of maximum_path search: https://github.com/coqui-ai/TTS/blob/dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e/TTS/tts/utils/helpers.py#L197
Long time ago I also had problems with seg faults, running the training with gdb showed that it was related to maximum_path and using the numpy version fixed it.
That's not what I suggested...
Apologies, I totally misunderstood what you were suggesting. I modified the maximum_path() function in pflow/models/components/aligner.py replacing its logic with that of maximum_path_numpy() from coqui-ai/TTS and tried again and the error still persists.
I'm not sure if that's the change I was supposed to make, if that's incorrect please let me know. Also, can you point to how I can run this training with gdb so I can try that?
loop through your files and check if any audio file is empty.
also try ljspeech, if it works, means problem in dataset.
Apologies, I totally misunderstood what you were suggesting. I modified the maximum_path() function in pflow/models/components/aligner.py replacing its logic with that of maximum_path_numpy() from coqui-ai/TTS and tried again and the error still persists.
I'm not sure if that's the change I was supposed to make, if that's incorrect please let me know. Also, can you point to how I can run this training with gdb so I can try that?
you need to use maximum_path_numpy in pflow/models/pflow_tts in line 152
Looks like this issue is duplicate of #24 Also I just noticed that in this repository there is two implementations of MAS, first one here pflow/models/components/aligner.py And second one written on Cython. And the second one that is written on cython crashes.
So you should replace it with pytorch or numpy implementations in this line https://github.com/p0p4k/pflowtts_pytorch/blob/e700677ddef9f2ff3895c342aa0698e44baaec04/pflow/models/pflow_tts.py#L152
Hi, apologies for the late response. I changed
from pflow.utils.monotonic_align import maximum_path
attn = (
maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
)
to
from pflow.models.components.aligner import maximum_path
attn = (
maximum_path(neg_cent, attn_mask.squeeze(1)).unsqueeze(1).detach()
)
The crashes stopped but the model, even at ep 200,l is synthesizing indecipherable noise. I tried this with both Ljspeech and the custom dataset I used to train matcha before and for both cases, the modification made the model synthesize gibberish. Also, the cython crashes are not happening with LJspeech, just the custom dataset.
Tensorboard for ljspeech with the modified maximum_path(): https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/229343af-33ae-4d25-9a5c-aa6296f71c3d
If cython crashes do not happen with ljspeech, make sure non of your audio in the custom dataset is of 0 length. Take a very small subset of the custom dataset (50 samples) and train the model on that. See if cython still crashes.
This will not fix the problem, we tried such thing. Also it crashes not every epoch, so if there is zero length files and they cause crash it should crash when pflow process it. But we also tried to logging to terminal batch that crashes and remove it from dataset for the next time. And it didn't help. It looks like it crashes randomly, and only with big datasets. I was able to train it without modifications and without crashes on 45 hours dataset. But on 300 hours it start to crash. @bharathraj-v What size is your dataset? Also it is very strange that implementation in pflow/models/components/aligner.py
doesn't work. Maybe it returns result in different format? But what I now that definitely works it is our implementation on numba jit
aligner_jit.txt
@p0p4k, with the cython implementation of maximum_path, it's not crashing with low samples (60 training and 20 val). The dataset I'm using is around 25 hours, All of the data is > 2s in duration.
@patriotyk, with maximum_path_jit, the training stopped crashing and the model is learning. It's taking 3.5min per epoch and is currently at 57 epochs. I will update the thread again in case it crashes, else I'll close the issue.
May I ask what the recommended amount of epochs is to train with this data size? Also, unrelated to the issue, what are the key takeaways for vits2 vs pflow?
Thanks!
Pflow is diffusion generalized. I think it's better than vits.
The training is not going past 1730 epochs with the validation step after that epoch crashing with this error -
File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/models/baselightningmodule.py", line 224, in on_validation_end
output = self.synthesise(x[:, :x_lengths], x_lengths, prompt=prompt_slice, n_timesteps=10, guidance_scale=0.0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/pflow_tts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/models/pflow_tts.py", line 93, in synthesise
y_mask = sequence_mask(y_lengths, y_max_length_).unsqueeze(1).to(x_mask.dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/azureuser/users/bharath/pflowtts_pytorch/pflow/utils/model.py", line 10, in sequence_mask
x = torch.arange(max_length, dtype=length.dtype, device=length.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: upper bound and larger bound inconsistent with step sign
I think the y_maxlength is being assigned with a negative value. The training so far seems to have gone well but it's crashing now, I'm looking to train for more epochs. Any help regarding this error?
@bharathraj-v did you check tensorboard?
@bharathraj-v did you check tensorboard?
Yes, Here's the tensorboard: https://github.com/p0p4k/pflowtts_pytorch/assets/69118968/543831fd-f737-455e-9f8c-c7c9a8b45229 Not sure why the epoch logs are only till 400 whereas the rest of the logs are for all 1730 epochs.
@bharathraj-v Your training collapsed since you have NaN values.
Hi,
I modified the pflow/text/symbols.py, pflow/text/cleaners.py and configs/model/pflow.yaml similarly to what I had done in MatchaTTS to make the repo work with an Indian language.
But the training keeps crashing with a
Segmentation fault (core dumped)
error after a certain no. of epochs. The epoch at which segfault is happening is varying depending on the batch size, with higher batch sizes crashing quicker. I'm training with 20 num_workers on a NVIDIA A100 80GB. The batch size I last used was 14 which crashed at epoch 23, previously 17 batch size crashed at epoch 8, and any higher batch size than that was going OOM.Any guidance regarding this issue would be of great help.
Thank you!