jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

Question regarding fine tunning #54

Closed debasish-mihup closed 2 years ago

debasish-mihup commented 3 years ago

If I train the hifi-gan vocoder using fine tunning approach, which uses Tacotron 2 to generate Mel in the first place. Can I use the regular glow-tts generated Mel with the above trained hifi-gan vocoder and achieve same quality during the inference or do I have to shift to Tacotron 2 altogether during the inference too?

debasish-mihup commented 3 years ago

@jaywalnut310 Any input regarding this query?

debasish-mihup commented 2 years ago

Did this on my own. If any one needs help can share my colab and instructions.

michaellin99999 commented 2 years ago

Hi would you be able to share your Colab on how you did this? We are trying to do the same thing using glowtts mel to train hifigan and are running into trouble

michaellin99999 commented 2 years ago

Did this on my own. If any one needs help can share my colab and instructions.

Would you be able to share your colab and instructions we’ve been stuck on this for weeks

debasish-mihup commented 2 years ago

@michaellin99999 It has been long time and I don't recall the process exactly and I currently don't have access to infra to test out the script.

But hopefully the below steps should work.

1) Clone the Tacotron 2 Nvidia official repo and in the main directory of the copy the two attached files finetunning.zip 2) Run the extract_teacher_force.py script. A sample command with various arguments is present at the last line of the file.

Let me know if this works out for you.

michaellin99999 commented 2 years ago

@michaellin99999 It has been long time and I don't recall the process exactly and I currently don't have access to infra to test out the script.

But hopefully the below steps should work.

  1. Clone the Tacotron 2 Nvidia official repo and in the main directory of the copy the two attached files finetunning.zip
  2. Run the extract_teacher_force.py script. A sample command with various arguments is present at the last line of the file.

Let me know if this works out for you.

is it the same method even with GlowTTS?

and Nvidia official repo is it this one? https://github.com/NVIDIA/tacotron2

we are trying to use glow-tts generated Mel to train /fine tune Hifigan, I am a bit confused why do we use the Tacotron repo. Is the logic that the extract_teacher_force.py script is ran to automatically generate mel spectrograms that can be fed into Hifigan?

debasish-mihup commented 2 years ago

Yes. that is the correct repo.

To explain the fine tunning in simple terms, there are two ways to generate the Mels that are can be used during the training phase. You use the actual audio files and use deterministic algorithm to get the true Mels, which can then be used during Hifigan training. But this is ideal scenario and during the inference phase usually the mels predicted won't be that perfect. So what do we do, we predict the mels from a pretrained Tacotron 2 model because given it's auto-regressive nature, we can create the Mels which match the exact length of the input audio file and as these are predicted Mels, thus using these predicted Mels during training phase would help the hifi-gan to learn better and be little more resilient to noisy (less than perfect) Mels, thus perform little better during inference phase.

Once you generate the Mels from the fine-tunning step, you can store the same and use these stored mels instead during the Hifi-gan training phase. You probably don't want to use these mels during the glow-tts training phase though, there you would be better off in using the actual mels generated from the underlying audio file, as the goal of the glow-tts training is to be able to generate Mels as close as possible to original.

Hope I was able to explain it to you. Let me know if you have any doubts.

michaellin99999 commented 2 years ago

Yes. that is the correct repo.

To explain the fine tunning in simple terms, there are two ways to generate the Mels that are can be used during the training phase. You use the actual audio files and use deterministic algorithm to get the true Mels. But this is ideal scenario and during the inference phase usually the mels predicted won't be that perfect. So what do we do, we predict the mels from a pretrained Tacotron 2 model because given it's auto-regressive nature, we can create the Mels which match the exact length of the input audio file and as these are predicted Mels, thus using these predicted Mels during training phase would help the hifi-gan to learn better and be little more resilient to noisy (less than perfect) Mels, thus perform little better during inference phase.

Once you generate the Mels from the fine-tunning step, you can store the same and use these stored mels instead during the Hifi-gan training phase. You probably don't want to use these mels during the glow-tts training phase, there you would be better in using the actual mels generated from the underlying audio file, as the goal of the glow-tts training is to be able to generate Mels as close as possible to original.

Hope I was able to explain it to you. Let me know if you have any doubts.

Thank you for the explanation, Just to confirm I udnerstand correct, basically its best to obtain mels from tacotron and use the mels to train hifigan. Later when I connect glowtts with hifigan, other than the config and audio parameters, they should work in sequence to synthesize audio?

debasish-mihup commented 2 years ago

Yes. that is the correct repo. To explain the fine tunning in simple terms, there are two ways to generate the Mels that are can be used during the training phase. You use the actual audio files and use deterministic algorithm to get the true Mels. But this is ideal scenario and during the inference phase usually the mels predicted won't be that perfect. So what do we do, we predict the mels from a pretrained Tacotron 2 model because given it's auto-regressive nature, we can create the Mels which match the exact length of the input audio file and as these are predicted Mels, thus using these predicted Mels during training phase would help the hifi-gan to learn better and be little more resilient to noisy (less than perfect) Mels, thus perform little better during inference phase. Once you generate the Mels from the fine-tunning step, you can store the same and use these stored mels instead during the Hifi-gan training phase. You probably don't want to use these mels during the glow-tts training phase, there you would be better in using the actual mels generated from the underlying audio file, as the goal of the glow-tts training is to be able to generate Mels as close as possible to original. Hope I was able to explain it to you. Let me know if you have any doubts.

Thank you for the explanation, Just to confirm I udnerstand correct, basically its best to obtain mels from tacotron and use the mels to train hifigan. Later when I connect glowtts with hifigan, other than the config and audio parameters, they should work in sequence to synthesize audio?

Yes.These stored Mels generated is to be used only for Hifigan training phase. During the inference phase there is no need for the Tacotron.

michaellin99999 commented 2 years ago

Yes. that is the correct repo. To explain the fine tunning in simple terms, there are two ways to generate the Mels that are can be used during the training phase. You use the actual audio files and use deterministic algorithm to get the true Mels. But this is ideal scenario and during the inference phase usually the mels predicted won't be that perfect. So what do we do, we predict the mels from a pretrained Tacotron 2 model because given it's auto-regressive nature, we can create the Mels which match the exact length of the input audio file and as these are predicted Mels, thus using these predicted Mels during training phase would help the hifi-gan to learn better and be little more resilient to noisy (less than perfect) Mels, thus perform little better during inference phase. Once you generate the Mels from the fine-tunning step, you can store the same and use these stored mels instead during the Hifi-gan training phase. You probably don't want to use these mels during the glow-tts training phase, there you would be better in using the actual mels generated from the underlying audio file, as the goal of the glow-tts training is to be able to generate Mels as close as possible to original. Hope I was able to explain it to you. Let me know if you have any doubts.

Thank you for the explanation, Just to confirm I udnerstand correct, basically its best to obtain mels from tacotron and use the mels to train hifigan. Later when I connect glowtts with hifigan, other than the config and audio parameters, they should work in sequence to synthesize audio?

Yes.These stored Mels generated is to be used only for Hifigan training phase. During the inference phase there is no need for the Tacotron.

Thanks so much, will try this out and see.

michaellin99999 commented 2 years ago

one more question, regarding the audio parameters. other than fmin and fmax, what other paramters are critical that I must keep the same across glow and hifigan? or everything should be the same?

debasish-mihup commented 2 years ago

one more question, regarding the audio parameters. other than fmin and fmax, what other paramters are critical that I must keep the same across glow and hifigan? or everything should be the same?

Given that my training audio data was 22050Hz I kept the fmin and fmax unchanged. I don't remember any other critical parameter. Although if I recall correctly, I did use the parameter to turn the add noise flag off, as without adding noise during the training phase the output seemed to me marginally better. But this difference was very low, and can be matter of personal judgement. Basically the default parameter seemed to be working OK.

Wayne-wonderai commented 2 years ago

@michaellin99999 It has been long time and I don't recall the process exactly and I currently don't have access to infra to test out the script.

But hopefully the below steps should work.

  1. Clone the Tacotron 2 Nvidia official repo and in the main directory of the copy the two attached files finetunning.zip
  2. Run the extract_teacher_force.py script. A sample command with various arguments is present at the last line of the file.

Let me know if this works out for you.

Hello, I have tried these steps, generate mel-spectrogram by run your extract_teacher_force.py script. But when I put it in hifigan's folder, and follow it's fine tunning step Screenshot from 2022-01-06 17-36-19 And I got this error

Screenshot from 2022-01-06 14-33-42 What possible reason could this happended? Thanks!

kafan1986 commented 2 years ago

There is a slight difference between the padding logic of Tacotron 2 and glow tts. You can copy few lines from the glow tts and replace them in the tacotron 2 library to make that padding logic same.

Wayne-wonderai commented 2 years ago

Thanks for replying. Another question, do you have any idea of how to verifiy the mel-spectrogram that generated by your extract_teacher_force.py script? I try to use librosa to inverse back to wav, but it sounds like nothing.