p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

Crash in MAS #24

Open patriotyk opened 5 months ago

patriotyk commented 5 months ago

We are experiencing a strange issue. With one our big dataset (about 300 hours) MAS is randomly crashes. Core dump shows following line:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007feaf9884576 in __pyx_f_5pflow_5utils_15monotonic_align_4core_maximum_path_each (__pyx_v_path=..., __pyx_v_value=..., __pyx_v_path=..., __pyx_v_value=..., __pyx_optional_args=0x0, __pyx_v_t_x=289, __pyx_v_t_y=279)
    at pflow/utils/monotonic_align/core.c:17615
17615       if (__pyx_t_7) {

We have tried everything but nothing did help. The only thing that helped was replacing MAS with AlignerNet but there was another issue - crash at inference, maybe synthesis method requires some changes too?

I have successfully trained pflowttss on single speaker dataset which is subset of this bigger dataset and it sounds great. Demo is here - https://tts.patriotyk.name

Also I have built and pushed to registry docker image which can be used to reproduce this issue, just need to pull and run it. I can share url in private message if you need it.

p0p4k commented 5 months ago

For MAS debug, I am not good at C++, yet. I can suggest one thing, just run MAS (encoder + spectrogram), everything else can be deleted in batch. At some batch it will fail, open that batch and run one sample at a time, you will the sample which gives problem.

patriotyk commented 5 months ago

Thank you for your fast answer. We have found files that causes crash, but they looks normal, after removing that file we are able to run train but then it crashes on another one. Also it may not crash few epochs, so all files have been successfully used for train, but then it crashes. Maybe it would be easier to switch to AlignerNet? I have uncommented code that you commented in constructor and in forward method and commented call to MAS. This works fine, it trains without crashes but inference crashes. Maybe you could help us with this? Do I need to change something in synthesise method to work it properly?

p0p4k commented 5 months ago

Sure, I'll fix AlignerNet synthesis in this week. What is the error during inference?

patriotyk commented 5 months ago

Oh you have edited you answer. It was a crash. If you need more info I can try to run it again and will tell you. But I think I may made some mistake. Maybe you can push changes somewhere in separate branch and I will compare.

p0p4k commented 5 months ago

Someone else tried aligner net and it worked ok for them. So I am not sure how to debug without error, if it's crash, maybe dataset issue? Does AlignerNet train on the small subset?

Tera2Space commented 5 months ago

I have uncommented code that you commented in constructor and in forward method and commented call to MAS. This works fine, it trains without crashes but inference crashes.

But AlignerNet isn't used during inference

p0p4k commented 5 months ago

True. If it crashes during training, I think it is the dataset issue.

patriotyk commented 5 months ago

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

p0p4k commented 5 months ago

Send some random input to the duration predictor, does it predict something?

Tera2Space commented 5 months ago

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

patriotyk commented 5 months ago

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

This evening I will try to run it again and tell you more details.

Tera2Space commented 5 months ago

No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.

Is the error something like "out of memory"?

This evening I will try to run it again and tell you more details.

If you want we can talk in telegram (https://t.me/TeraSpace) I speak Ukrainian and Russian.

patriotyk commented 5 months ago

@p0p4k I have tried again and yes, on inference I got out of memory error. Same as @Tera2Space mentioned.

p0p4k commented 5 months ago

So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.

patriotyk commented 5 months ago

It is small sentence. No it doesn't work. It doing something very long than crashes. https://drive.google.com/file/d/1WaIYiloaf3oDVtkWb5LH8YN0XWW2KXbR/view?usp=drivesdk

Tera2Space commented 5 months ago

Yep, same, I believe that AlignerNet didn't converged so duration predictor learn wrong alignments so at inference audio become very long and cause out of memory.

Tera2Space commented 5 months ago

So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.

1500 gibs of vram.... I think it's code issue because it happens at evaluation at training, when with MAS it works fine.

p0p4k commented 5 months ago

Give the model some random durations instead of using the duration predictor and try to see the output. (One duration integer per phoneme)

patriotyk commented 5 months ago

Sorry, but I don't know how to do that.

Tera2Space commented 5 months ago

I will try later to clamp out of aligner.

Tera2Space commented 4 months ago

Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.

I will test and if work I will create PR

p0p4k commented 4 months ago

Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.

I will test and if work I will create PR

Very interesting 🤔

lexkoro commented 2 months ago

also here the wild guess, do you mind trying to use the numpy version of maximum_path search: https://github.com/coqui-ai/TTS/blob/dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e/TTS/tts/utils/helpers.py#L197

Long time ago I also had problems with seg faults, running the training with gdb showed that it was related to maximum_path and using the numpy version fixed it.