seungwonpark / melgan

MelGAN vocoder (compatible with NVIDIA/tacotron2)
http://swpark.me/melgan/
BSD 3-Clause "New" or "Revised" License
632 stars 116 forks source link

Use this implementation for TTS engine #10

Open rishikksh20 opened 4 years ago

rishikksh20 commented 4 years ago

Can create separate branch for TTS implementation, that's the ultimate goal for every neural vocoder. I will try to use this implementation with nvidia's Tacotron2, as preprocessing for both networks are same.

Note : I am already working in it, and will post the output samples here by tomorrow.

seungwonpark commented 4 years ago

How about creating a new repository for this, and adding this repo as submodule in that repo?

For example, https://github.com/NVIDIA/tacotron2 uses waveglow repo as submodule.

rishikksh20 commented 4 years ago

@seungwonpark yeah sure, I will train MelGAN on GTA. I am also planning to train it in multiple voices as I have a huge repo of large (> 40 hrs) custom voice dataset.

seungwonpark commented 4 years ago

Thanks, that will be awesome. Let me know whenever you make that new repo public so that I can add a link to README.md here.

rishikksh20 commented 4 years ago

@seungwonpark checkout first sample : https://drive.google.com/drive/folders/1fPjLwMORsfilwPS9EAXUR_5ZKjUWbaIA?usp=sharing

Note: This is non GTA sample using nvidia's tacotron2

rishikksh20 commented 4 years ago

@seungwonpark official repo melgan

seungwonpark commented 4 years ago

Added official repo in README.md, thanks.

rishikksh20 commented 4 years ago

Just for Information: New paper on Parallel WaveGAN. The best part is it's fast, lightweight(1.4 M parameters only) and specially designed for TTS engine.

We propose Parallel WaveGAN, a distillation-free, fast, and small footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive
WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture
the time-frequency distribution of the realistic speech waveform.
As our method does not require density distillation used in the
conventional teacher-student framework, the entire model can be
easily trained even with a small number of parameters. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters
and can generate 24 kHz speech waveform 28.68 times faster
than real-time on a single GPU environment. Perceptual listening
test results verify that our proposed method achieves 4.16 mean
opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel
WaveNet system.
binarythinktank commented 4 years ago

After doing the training, how do i use the result to do TTS ?

rishikksh20 commented 4 years ago

@seungwonpark give a look at this paper: https://arxiv.org/pdf/2005.05106.pdf . Seems very promising, use two different model strategies.

seungwonpark commented 4 years ago

@rishikksh20 Thanks for letting me know. It's a bit disappointing to see that the proposed model was only tested with 16kHz. However, their ablation studies are persuasive and their approaches look promising to me, too. Let's wait for them to open-source the code.

seungwonpark commented 4 years ago

By the way, I've used the MelGAN for our new VC paper: https://arxiv.org/abs/2005.03295 Please have a look if you're interested in VC.

rishikksh20 commented 4 years ago

@seungwonpark Hope you are doing well ! I have coded Multi-band MelGAN paper and uses this repo as a base. Please give it a look https://github.com/rishikksh20/melgan . Though I just finished coding and havn't tested it yet.

xuexidi commented 3 years ago

@seungwonpark yeah sure, I will train MelGAN on GTA. I am also planning to train it in multiple voices as I have a huge repo of large (> 40 hrs) custom voice dataset.

@rishikksh20 Hi! What is the final result of training melgan with GTA?

I would like to train melgan in GTA mode (combine with SV2TTS) 。

But I'm not sure whther it's right:ignore preprocess.py in melgan, and just feed the mel spec from TTS(SV2TTS algorithm)and retrain melgan. (GTA mode)

Could you please give me some advice? Thanks a lot!

rishikksh20 commented 3 years ago

@xuexidi it doesn't give good result, I had trained model around 1.5 Million step but normal mel training gives better result than GTA. Though one option which I havn't tried is train Melgan for first 200k steps on GTA and then after that continue training with normal mels, this techniques explained to work in HiFI-GAN paper.

xuexidi commented 3 years ago

@xuexidi it doesn't give good result, I had trained model around 1.5 Million step but normal mel training gives better result than GTA. Though one option which I havn't tried is train Melgan for first 200k steps on GTA and then after that continue training with normal mels, this techniques explained to work in HiFI-GAN paper.

@rishikksh20

Thanks for your quick reply!!!

And thanks for your techniques advice!!!

But I still confused about: before I start training melgan as a vocoder for TTS,there are two options: 1). Change the mel extraction function of melgan,keep it same as TTS. 2).Change the mel extraction function of TTS,keep it same as melgan.

From your experience,which could perform better? And what option you choose in your experiment? I 'm pretty confusing about that :(

rishikksh20 commented 3 years ago

@xuexidi it doesn't matter choose any pre-processing (mel-extraction) and use the same for both TTS and melgan. The mels on which TTS trained always be same as on which melgan trained.

xuexidi commented 3 years ago

@xuexidi it doesn't matter choose any pre-processing (mel-extraction) and use the same for both TTS and melgan. The mels on which TTS trained always be same as on which melgan trained.

@rishikksh20 Thank you very much! I got it! I'll keep the mel format of TTS the same with melgan. Thank you!