tugstugi / pytorch-dc-tts

Text to Speech with PyTorch (English and Mongolian)
MIT License
184 stars 79 forks source link

Text2Mel input to WaveGlow outputs noisy audio file without any speech #10

Closed deepconsc closed 4 years ago

deepconsc commented 4 years ago

Hey!

I've trained text2mel part for mel generation for couple hundred epochs. Model seems to be learning, and it gives somehow good results on different language dataset while fed to SSRN (without fine-tuning the SSRN part). I'm trying to feed the text2mel output to trained WaveGlow model, but it outputs just low-frequency noise, without any speech.

Any advices how to post-process the generated mels to feed them to WaveGlow?

tugstugi commented 4 years ago

This shouldn't work because the output of the text2mel is not a full melspectrogram but 4 times downsampled one. See this parameter: https://github.com/tugstugi/pytorch-dc-tts/blob/master/hparams.py#L13

deepconsc commented 4 years ago

So, basically changing the reduction rate should work, right? Should i change the parameter to 0?

tugstugi commented 4 years ago

I think it should be 1. What you can also try is implement SSRN variant which upsamples only in the time dimension.

deepconsc commented 4 years ago

Thank you! Could you elaborate on SSRN time dimension upsampling suggestion?

tugstugi commented 4 years ago

Here is the upsampling linear spectrogram feature size: https://github.com/tugstugi/pytorch-dc-tts/blob/master/models/ssrn.py#L67 Maybe replace f_prime with f?

deepconsc commented 4 years ago

@tugstugi Thanks mate, very helpful!

deepconsc commented 4 years ago

@tugstugi Hey! Couldn't generate any promising result. Basically the output is honk sound. I've retrained the text2mel model, by cutting out mel reduction part in preprocessor, and changing the hparams to:

hop_length = 256 win_length = 1024 max_N = 180 # Maximum number of characters. max_T = 210 # Maximum number of mel frames. e = 512 # embedding dimension d = 256 # Text2Mel hidden unit dimension

any ideas?

tugstugi commented 4 years ago

You can check the generated alignments and spectrograms in the tensorboard whether it learnt any alignment.

deepconsc commented 4 years ago

They're absolutely perfect. Attention is really good, and mels are almost identical as well. And, I've trained it for almost 24 hours on Tesla V100, basically 300k iters is done.

tugstugi commented 4 years ago

What about the mel generated by https://github.com/tugstugi/pytorch-dc-tts/blob/master/synthesize.py#L110 ? You can post the result here.

tugstugi commented 4 years ago

What also important is, the generated mel range etc. should match the waveglow mel normalization. I can provide you an example for https://github.com/Rayhane-mamah/Tacotron-2:

def _fix_mel(mel):
    mel = audio._denormalize(mel, hparams)
    mel = audio._db_to_amp(mel + hparams.ref_level_db)
    C = 1
    mel = np.log(np.maximum(mel, 1e-5) * C)
    return mel

You should implement similar method for this repo. Otherwise waveglow can't generate meaningful audio.

deepconsc commented 4 years ago

Wow, didn't actually use image export option, and this is the first time I've checked it. Basically it's a blank black image. But, while supervising the tensorboard, mels are actually pretty good. 96518588_646819832830022_7434616465577213952_n

tugstugi commented 4 years ago

@deepconsc I have trained text2mal on the Mongolian dataset with reduction_rate=1 only 10k steps and fed into the LJSpeech waveglow:

mbspeech_text2mel_ljspeech_waveglow.zip

Seems to be working. Only the first word is intelligable but for 10k step, it is ok. Use this method before feeding the mel spectrogram into the waveglow:

def _fix_mel(mel):
    mel = (np.clip(mel, 0, 1) * hp.max_db) - hp.max_db + hp.ref_db
    mel = np.power(10.0, mel * 0.05)
    C = 1
    mel = np.log(np.maximum(mel, 1e-5) * C)
    return mel

Before saving the audio from the WaveGlow, also de-preemphasis with:

from scipy import signal
wav = signal.lfilter([1], [1, -hp.preemphasis], audio)

You can also train text2mel without the preemphasis.

Have you trained already a WaveGlow model? If not, you can use the NVidia LJSpeech one. It should work without any problem for any language and for any voice.

Also the above black image is only a bug for saving a PNG file. Fix the line https://github.com/tugstugi/pytorch-dc-tts/blob/master/utils.py#L72 with:

from skimage import img_as_ubyte
imsave(file_name, img_as_ubyte(array))
deepconsc commented 4 years ago

@tugstugi Thank you for your help, I really appreciate it. The thing is, I've tried almost everything, including your advices, but the result is exactly the same.

And, after fixing the image saving part, the same black blank image appears again.

I'm sharing the last part of my inference, It's basically ripping my head apart.

    Y = _fix_mel(Y)
    audio = waveglow.infer(Y.cuda(), sigma=0.65)
    audio = audio.data.cpu().numpy()[0]
    wav = signal.lfilter([1], [1, -hp.preemphasis], audio)
    librosa.output.write_wav("12.wav", wav, sr=hp.sr)

I've fine-tuned your pretrained model for more than 10k steps, and used waveglow's official checkpoint from torch hub. Anything on your mind?

tugstugi commented 4 years ago

Could you share your model? I can try to synthesize.

deepconsc commented 4 years ago

@tugstugi Here's the checkpoint: https://drive.google.com/open?id=1R50yR6Va6MJP0SO8GT0btwv15yhDyqNH

I changed the vocab, size is 34, just add ,! those two chars.

Btw, how can I reach you for business inquiries?

tugstugi commented 4 years ago

I have plotted the mel generated from your model: download

It can't generate anything. You can reach me at tugstugi AT gmail DOT com

tugstugi commented 4 years ago

@deepconsc

I have pushed to the waveglow branch a version with reduction_rate=4 and with SSRN upsamling only in the time direction.

You have to preprocess again the audio files. After that you can start to finetune from the old text2mel/SSRN models which is faster:

After 5K steps, generated audio is here: ljspeech_waveglow.zip

It seems still the mel denormalization got problem and the audio has really low frequency.

tugstugi commented 4 years ago

@deepconsc After setting f_max=8000.0 like in Waveglow, Waveglow generated speech sounds better: ljspeech_waveglow.zip If you change the audio normalization in this repo to NVidia Tacotron2 like normalization, it should sound better.

ican24 commented 4 years ago

Hi @tugstugi,

You waveglow branch is a wonderful idea! I am sure that the pytorch-dc-tts is an unjustly undervalued project. During last 4-5 months I have tested many TTS solutions by the big companies like Nvidia, EspNet, Mozilla, but only your project allow me to reach good results for my native language. Below is a youtube link of the experimental education video with 4 voices synthesized with the help of pytorch-dc-tts.
https://www.youtube.com/watch?v=AiHr3h0QZC4

Totally I had synthesized 8 voices on pytorch-dc-tts. The single problem is a very low speed synthesis of SSRN. It prevents for wide usage. I'll ask to continue further development of your project. Such way you may provide really quick and effective alternative in the modern TTS development for less spread languages. Particularly I am so interested in possibility to integrate Parallel WaveGAN vocoder with your project. A few days ago I had asked Parallel WaveGAN (https://github.com/kan-bayashi/ParallelWaveGAN) developer to help me to feed WaveGAN vocoder with your Text2Mel, but he rejected me maybe considering that your Text2Mel has no large usage. Personally I am ready to take full participation in this development and promotion of your project.

My greetings to @deepconsc from brotherly Georgia, who initiated this important process!

tugstugi commented 4 years ago

@ican24 the speed should be ok if you force the synthesize script to use the GPU. Currently it uses only the CPU, so it is much slower.

ican24 commented 4 years ago

Thank you for a hint! I remain a fan of this project. I'll try to feed text2mel to WaveGlow or/and WaveGAN vocoders. The Tacotron2 realizations are so hardware costly and long term sometimes with unpredictable results for less common languages.

ican24 commented 4 years ago

Dear @tugstugi

The speech synthesis with GPU is about 5 times faster than CPU: 200 characters within 8-9 seconds on Nvidia RTX 2080 12GB (single GPU). I may decrease execution time with some cosmetic optimization, but I would like o know Is there a chance to accelerate it more with basic code modification? If yes, how much decrease in duration possible?

Meantime I had tried your waveglow branch, but I failed train-text2mel.py with --warm-start because of
AttributeError: 'dict' object has no attribute 'state_dict' error. It seems my working text2mel model is so modified. Therefore I made decision to train from scratch, but the training interrupted with "CUDA out of memory" error. Could you advise how to reduce GPU consumption? Is there "batch_size" like parameter in hparams.py?

Last my question: Are following parameters valuable sensitive for synthesis quality?

dropout_rate = 0.05 # dropout

Text2Mel network options

text2mel_lr = 0.005

SSRN network options

ssrn_lr = 0.0005

Thank you in advance!

tugstugi commented 4 years ago

The text2mel is slow because it uses the generated mels from the previous steps in auto regressive manner. Maybe batch synthesis would help you if you want to synthesize multiple sentences.

The batch size is hardcoded here: https://github.com/tugstugi/pytorch-dc-tts/blob/master/train-text2mel.py#L36

Without dropout, it will overfit to the training set because there is no data augmentation.

ican24 commented 4 years ago

Thank you very much! The reduction of batch_size from 64 to 32 is useful. I'll train from scratch in waveglow branch and compare the old results.
I am really interested to lean all capabilities of this project expecting it maybe good start for serious development of TTS solution for less common languages.

ican24 commented 4 years ago

Hi @tugstugi ,

I am trying to feed text2mel output to the trained waveglow model. Below is a descripton of my steps:

  1. Both pretrained models were trained in waveglow branch from scratch up to 55K. python train-text2mel.py --dataset=ljspeech --warm-start=ljspeech-text2mel.pth python train-ssrn.py --model=SSRNv2 --dataset=ljspeech --warm-start=ljspeech-ssrn.pth
  2. waveglow_256channels_universal_v5.pt was downloaded
  3. glow.py was coped from nvidia waveglow project
  4. waveglow usage code was added to synthesize.py. audio = waveglow.infer(Y, sigma=0.65) audio = audio.data.cpu().numpy()[0] wav = signal.lfilter([1], [1, -hp.preemphasis], audio) librosa.output.write_wav('samples/%d-wav.wav' % (i + 1), wav, sr=hp.sr)

and

waveglow feeding and voice generation piece in the loop Y = _fix_mel(Y) audio = waveglow.infer(Y, sigma=0.65) audio = audio.data.cpu().numpy()[0] wav = signal.lfilter([1], [1, -hp.preemphasis], audio) librosa.output.write_wav('samples/%d-wav.wav' % (i + 1), wav, sr=hp.sr)

The code is working, but it generates poor and very quicky playing wav file. You may hear it here https://arm.ican24.net/wavesurfer1.php

save_to_png('samples/%d-mel.png' % (i + 1), Y[0, :, :]) fails with error ValueError: Images of type float must be between -1 and 1. Same for "wav" variable.

Surely my code is not perfect (just a sketch), but before making any further steps I need your advice to be sure that this way is possible to reach the result. Also I remember that NVidia Tacotron2 like normalization is required.

I am attaching synthesize.py and sample wav file. waveglow_synth.zip

If you need my checkpoints for more serious investigation you may download them from here: text2mel-step-55K.pth https://drive.google.com/open?id=1-3jSdrJunFozWntyZwTfE5MmaG7LHmHQ ssrn-step-55K.pth: a bit later, it is yet uploading

Thank you in advance!

P.S. Waveglow was tested with custom language and dataset too. The quick speech is understandable despite of arteffects. So it may be a great achivement of this project, if you help to fix the issue of the sound generation algorithm. Hencefore the TTS developers may save ton of time and nerves in speech synthesis without extremely costly hardware.

tugstugi commented 4 years ago

Your synthesize code looks wrong, do this way:

_, Z = ssrn(Y.cuda())
audio = waveglow.infer(_fix_mel(Z).cuda(), sigma=0.65)

You fed the reduced mel directly to the waveglow -> generated WAV is 4 times faster.

ican24 commented 4 years ago

Thank you very much! It works. Below is generated sample https://arm.ican24.net/wavesurfer1.php

How we can tune speech quality: to train it else more than 55K or there are possible tricks like normalization and others?

tugstugi commented 4 years ago

@ican24 your link has still the fast WAV file. Could you upload the new generated wav?

ican24 commented 4 years ago

1-wav.wav.zip

tugstugi commented 4 years ago

Well you can to preprocess the wav files like the waveglow and train again. But it needs many code changes i.e. you have to use logits insead of the outputs after the sigmoid layer because the tacotron/waveglow mels are not normalized between 0 and 1.

ican24 commented 4 years ago

It sounds for me difficult. Frankly my experience in ML is not so long, but I will try to go ahead toward this aim, because the language projects are so important for our initiative, which is an absolutely public (not commercial nor state supported). This project could be a breakthrough in contemporary TTS development, where the big players like Google, Nvidia, Amazon, Facebook are dominated trying to press and push back others with the costly hardware tools and conditions. I'll be so appreciated, if you share your ideas time by time. Thank you

ghost commented 4 years ago

Hello @ican24 I make other language too, but no make good Vocab file. Can you share your file, I see how make this? I work language chinese and some other.