Open patriotyk opened 5 months ago
For MAS debug, I am not good at C++, yet. I can suggest one thing, just run MAS (encoder + spectrogram), everything else can be deleted in batch. At some batch it will fail, open that batch and run one sample at a time, you will the sample which gives problem.
Thank you for your fast answer.
We have found files that causes crash, but they looks normal, after removing that file we are able to run train but then it crashes on another one. Also it may not crash few epochs, so all files have been successfully used for train, but then it crashes. Maybe it would be easier to switch to AlignerNet? I have uncommented code that you commented in constructor and in forward
method and commented call to MAS. This works fine, it trains without crashes but inference crashes. Maybe you could help us with this? Do I need to change something in synthesise
method to work it properly?
Sure, I'll fix AlignerNet synthesis in this week. What is the error during inference?
Oh you have edited you answer. It was a crash. If you need more info I can try to run it again and will tell you. But I think I may made some mistake. Maybe you can push changes somewhere in separate branch and I will compare.
Someone else tried aligner net and it worked ok for them. So I am not sure how to debug without error, if it's crash, maybe dataset issue? Does AlignerNet train on the small subset?
I have uncommented code that you commented in constructor and in
forward
method and commented call to MAS. This works fine, it trains without crashes but inference crashes.
But AlignerNet isn't used during inference
True. If it crashes during training, I think it is the dataset issue.
No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.
Send some random input to the duration predictor, does it predict something?
No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.
Is the error something like "out of memory"?
No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.
Is the error something like "out of memory"?
This evening I will try to run it again and tell you more details.
No it doesn't crash during the training. I got crash after I loaded trained checkpoint on synthesis method.
Is the error something like "out of memory"?
This evening I will try to run it again and tell you more details.
If you want we can talk in telegram (https://t.me/TeraSpace) I speak Ukrainian and Russian.
@p0p4k I have tried again and yes, on inference I got out of memory error. Same as @Tera2Space mentioned.
So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.
It is small sentence. No it doesn't work. It doing something very long than crashes. https://drive.google.com/file/d/1WaIYiloaf3oDVtkWb5LH8YN0XWW2KXbR/view?usp=drivesdk
Yep, same, I believe that AlignerNet didn't converged so duration predictor learn wrong alignments so at inference audio become very long and cause out of memory.
So, maybe try a small sentence inference? Does that work? If it does, then just memory issue and not code ksse.
1500 gibs of vram.... I think it's code issue because it happens at evaluation at training, when with MAS it works fine.
Give the model some random durations instead of using the duration predictor and try to see the output. (One duration integer per phoneme)
Sorry, but I don't know how to do that.
I will try later to clamp out of aligner.
Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.
I will test and if work I will create PR
Now I’m wondering if the problem might be that we use text encoder outputs as input to alignernet, which(text encoder outputs) are passed through convolution (to get dimensions like mel frame)? Because while I was testing pitch predictor it didn't work when conditioned on output of text encoder, but when i tried to use x_emb directly it worked.
I will test and if work I will create PR
Very interesting 🤔
also here the wild guess, do you mind trying to use the numpy version of maximum_path search: https://github.com/coqui-ai/TTS/blob/dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e/TTS/tts/utils/helpers.py#L197
Long time ago I also had problems with seg faults, running the training with gdb showed that it was related to maximum_path and using the numpy version fixed it.
We are experiencing a strange issue. With one our big dataset (about 300 hours) MAS is randomly crashes. Core dump shows following line:
We have tried everything but nothing did help. The only thing that helped was replacing
MAS
withAlignerNet
but there was another issue - crash at inference, maybesynthesis
method requires some changes too?I have successfully trained pflowttss on single speaker dataset which is subset of this bigger dataset and it sounds great. Demo is here - https://tts.patriotyk.name
Also I have built and pushed to registry docker image which can be used to reproduce this issue, just need to pull and run it. I can share url in private message if you need it.