p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper
https://neurips.cc/virtual/2023/poster/69899
MIT License
198 stars 28 forks source link

High duration loss #40

Open w11wo opened 3 months ago

w11wo commented 3 months ago

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

image

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

Oleksandr2505 commented 3 months ago

Hi @p0p4k, thanks for making this repo!

I am currently trying to train a 44.1kHz English model, but my model is struggling with a rather high duration loss when compared against your TensorBoard logs. It currently looks as follows:

image

It seems like the other loss terms are correct.

Also, when the generated mel-spectrogram is passed to a vocoder, the audio is very much wrong in pronunciation -- maybe only half right.

My P-Flow config can be found here, and the corresponding HiFi-GAN vocoder config can be found here.

Could you please let me know where I might be wrong? Thanks in advance!

Hello, Make sure u wrote down "english_cleaners2" in synthesise.py it has to be like this when u generate english speech, if you included "english_cleaners3" in your json config, so put it in synthesise.py

sequence = torch.tensor(intersperse(text_to_sequence(stressed_text, ['english_cleaners2']), 0), dtype=torch.long,

w11wo commented 3 months ago

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

Oleksandr2505 commented 3 months ago

Hi @Oleksandr2505.

Yes I have changed both the training and inference phonemizer to english_cleaners3, so it's definitely not that issue. I've also checked the phonemization output and it is correct.

It's also unrelated to the issue I raised, which is more on the training duration loss, not inference.

I am training my model on english 4h 22050kHz dataset, I've recently passed 1000epochs and it sounds good. Maybe you should try switch the vocoder, because at the start I had a huge artifacts, I took vocoder trained on VCTK v1 dataset and it became better. Maybe you should look for another vocoder to match 44 sample rate. You also texted about training duration loss, is it something different from regular loss? Btw this is my logs. image_2024-03-26_062009153

w11wo commented 3 months ago

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Oleksandr2505 commented 3 months ago

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Yea, Actually I see that author got less then 1, but we don't know his exact config. So we can only test and try, I have this parameter very close 1 as well, eventually it can get under it, but can't know on 100%. One more tip, when I was training ukrainian language and then tried to run an inferenceof the english model, I was getting half right and wrong words It was mainly cause of cleaners, but I was changing something in pflow/text in folder like "symbols" and other. So make sure you did not add something another instead of english symbols or numbers. I've fixed in this way.

image_2024-03-26_064010400

Oleksandr2505 commented 3 months ago

Yeah, your loss curves look very much like mine.

I'd say I probably don't have an issue with the vocoder, since it's not an issue of artifacts, but pronunciation quality. And yes, the HiFi-GAN I'm using has been trained on 44.1kHz, which isn't a problem.

My main suspect is from my rather high duration losses (~2), versus the TensorBoard graphs posted on the README which can easily be < 1.0 early on, which I thought was the expected trend. It's likely that this affects the pronunciation quality too.

I just wanted to make sure that my 44.1kHz P-Flow config is correct, in case I'm missing something. But thanks for your suggestions nonetheless.

Btw, I would like to know, does your models synthesis the descent sufficient pauses between the sentences or not? I could not find how to control it

w11wo commented 3 months ago

@Oleksandr2505 I can't tell, to be honest. The pronunciation isn't good enough to determine the word pauses. But per my experiments with other TTS models (VITS, VITS2), you just have to train with audio that has the long enough pauses that you want.

If you want models with more customizable pauses, I'd recommend something like FastSpeech2 which is more deterministic since it does a more deterministic duration prediction, I think. VITS/VITS2/P-Flow is better for more expressive speech.

Oleksandr2505 commented 3 months ago

@w11wo

thank you for info!

w11wo commented 3 months ago

Update: I tried changing many different setups, but found quite an interesting change that finally led to a decent performance.

The duration loss (which contributes to the overall loss), converges to a lower value faster if the model's vocab size is 178, derived from the default LJSpeech + espeak setup. In the logs above where I used gruut as the phonemizer, I used a smaller vocab size of 78 in the model config. Simply increasing the vocab size back to 178 led to a better convergence of the duration loss (for some reason). However, I'm still unable to get a loss of < 1 unlike the one posted in the README.

patriotyk commented 3 months ago

But do you see difference in audio output for model with vocab 178 and 78? Or only metrics are different?

w11wo commented 3 months ago

@patriotyk Yes, I do hear a massive difference in audio output. I think the duration loss values reflect the overall performance of the audio as well, given that the validation loss also decreases with the increased vocab size.

Eventually I could get the duration loss to ~1.0, which is much better than the initial experiments with the smaller vocab size.

The odd thing is why vocab size even impacts the duration loss -- I'm clueless about this.

patriotyk commented 3 months ago

@w11wo I have found that configs/model/pflow.yaml contains n_vocab that is equal to 178. I thing it should be 78 in your case and maybe you will get even better(less then 1.0) duration loss because all 78 symbols are real and you will not have unused symbols.

w11wo commented 3 months ago

@patriotyk not really. What I meant by my comments above is that I started with vocab size of 78, which led to the high loss graph as attached at the top of this issue post. It is only by increasing it to 178 that I could achieve a low duration loss of about 1.

patriotyk commented 3 months ago

uh, I misunderstood you. By changing vocab you meant also changing this value, I thought you where changing real symbols count but didn't do this in the config.

w11wo commented 3 months ago

Ah yeah. Whenever I changed the symbol list, I also made sure to change the vocab size value in the config. It's weird how an actual symbol size of 78 would benefit by increasing the model config's vocab size.

Oleksandr2505 commented 1 month ago

@w11wo Hi, i just recalled u mentioned about training model using hifigan 44.1khz vocoder, where did you get it or how train it?

w11wo commented 1 month ago

Hi @Oleksandr2505. I ended up training a 44.1kHz HiFi-GAN from scratch, which took about 6 days on a 1xH100 GPU. You can find our fork of the HiFi-GAN training code/repo here.