Closed deepconsc closed 4 years ago
This shouldn't work because the output of the text2mel is not a full melspectrogram but 4 times downsampled one. See this parameter: https://github.com/tugstugi/pytorch-dc-tts/blob/master/hparams.py#L13
So, basically changing the reduction rate should work, right? Should i change the parameter to 0?
I think it should be 1. What you can also try is implement SSRN variant which upsamples only in the time dimension.
Thank you! Could you elaborate on SSRN time dimension upsampling suggestion?
Here is the upsampling linear spectrogram feature size: https://github.com/tugstugi/pytorch-dc-tts/blob/master/models/ssrn.py#L67 Maybe replace f_prime
with f
?
@tugstugi Thanks mate, very helpful!
@tugstugi Hey! Couldn't generate any promising result. Basically the output is honk sound. I've retrained the text2mel model, by cutting out mel reduction part in preprocessor, and changing the hparams to:
hop_length = 256
win_length = 1024
max_N = 180 # Maximum number of characters.
max_T = 210 # Maximum number of mel frames.
e = 512 # embedding dimension
d = 256 # Text2Mel hidden unit dimension
any ideas?
You can check the generated alignments and spectrograms in the tensorboard whether it learnt any alignment.
They're absolutely perfect. Attention is really good, and mels are almost identical as well. And, I've trained it for almost 24 hours on Tesla V100, basically 300k iters is done.
What about the mel generated by https://github.com/tugstugi/pytorch-dc-tts/blob/master/synthesize.py#L110 ? You can post the result here.
What also important is, the generated mel range etc. should match the waveglow mel normalization. I can provide you an example for https://github.com/Rayhane-mamah/Tacotron-2:
def _fix_mel(mel):
mel = audio._denormalize(mel, hparams)
mel = audio._db_to_amp(mel + hparams.ref_level_db)
C = 1
mel = np.log(np.maximum(mel, 1e-5) * C)
return mel
You should implement similar method for this repo. Otherwise waveglow can't generate meaningful audio.
Wow, didn't actually use image export option, and this is the first time I've checked it. Basically it's a blank black image. But, while supervising the tensorboard, mels are actually pretty good.
@deepconsc I have trained text2mal on the Mongolian dataset with reduction_rate=1 only 10k steps and fed into the LJSpeech waveglow:
mbspeech_text2mel_ljspeech_waveglow.zip
Seems to be working. Only the first word is intelligable but for 10k step, it is ok. Use this method before feeding the mel spectrogram into the waveglow:
def _fix_mel(mel):
mel = (np.clip(mel, 0, 1) * hp.max_db) - hp.max_db + hp.ref_db
mel = np.power(10.0, mel * 0.05)
C = 1
mel = np.log(np.maximum(mel, 1e-5) * C)
return mel
Before saving the audio from the WaveGlow, also de-preemphasis with:
from scipy import signal
wav = signal.lfilter([1], [1, -hp.preemphasis], audio)
You can also train text2mel without the preemphasis.
Have you trained already a WaveGlow model? If not, you can use the NVidia LJSpeech one. It should work without any problem for any language and for any voice.
Also the above black image is only a bug for saving a PNG file. Fix the line https://github.com/tugstugi/pytorch-dc-tts/blob/master/utils.py#L72 with:
from skimage import img_as_ubyte
imsave(file_name, img_as_ubyte(array))
@tugstugi Thank you for your help, I really appreciate it. The thing is, I've tried almost everything, including your advices, but the result is exactly the same.
And, after fixing the image saving part, the same black blank image appears again.
I'm sharing the last part of my inference, It's basically ripping my head apart.
Y = _fix_mel(Y)
audio = waveglow.infer(Y.cuda(), sigma=0.65)
audio = audio.data.cpu().numpy()[0]
wav = signal.lfilter([1], [1, -hp.preemphasis], audio)
librosa.output.write_wav("12.wav", wav, sr=hp.sr)
I've fine-tuned your pretrained model for more than 10k steps, and used waveglow's official checkpoint from torch hub. Anything on your mind?
Could you share your model? I can try to synthesize.
@tugstugi Here's the checkpoint: https://drive.google.com/open?id=1R50yR6Va6MJP0SO8GT0btwv15yhDyqNH
I changed the vocab, size is 34, just add ,!
those two chars.
Btw, how can I reach you for business inquiries?
I have plotted the mel generated from your model:
It can't generate anything. You can reach me at tugstugi AT gmail DOT com
@deepconsc
I have pushed to the waveglow branch a version with reduction_rate=4 and with SSRN upsamling only in the time direction.
You have to preprocess again the audio files. After that you can start to finetune from the old text2mel/SSRN models which is faster:
python train-text2mel.py --dataset=ljspeech --warm-start=ljspeech-text2mel.pth
python train-ssrn.py --model=SSRNv2 --dataset=ljspeech --warm-start=ljspeech-ssrn.pth
After 5K steps, generated audio is here: ljspeech_waveglow.zip
It seems still the mel denormalization got problem and the audio has really low frequency.
@deepconsc After setting f_max=8000.0 like in Waveglow, Waveglow generated speech sounds better: ljspeech_waveglow.zip If you change the audio normalization in this repo to NVidia Tacotron2 like normalization, it should sound better.
Hi @tugstugi,
You waveglow branch is a wonderful idea!
I am sure that the pytorch-dc-tts is an unjustly undervalued project.
During last 4-5 months I have tested many TTS solutions by the big companies like Nvidia, EspNet, Mozilla, but only your project allow me to reach good results for my native language.
Below is a youtube link of the experimental education video with 4 voices synthesized with the help of pytorch-dc-tts.
https://www.youtube.com/watch?v=AiHr3h0QZC4
Totally I had synthesized 8 voices on pytorch-dc-tts. The single problem is a very low speed synthesis of SSRN. It prevents for wide usage. I'll ask to continue further development of your project. Such way you may provide really quick and effective alternative in the modern TTS development for less spread languages. Particularly I am so interested in possibility to integrate Parallel WaveGAN vocoder with your project. A few days ago I had asked Parallel WaveGAN (https://github.com/kan-bayashi/ParallelWaveGAN) developer to help me to feed WaveGAN vocoder with your Text2Mel, but he rejected me maybe considering that your Text2Mel has no large usage. Personally I am ready to take full participation in this development and promotion of your project.
My greetings to @deepconsc from brotherly Georgia, who initiated this important process!
@ican24 the speed should be ok if you force the synthesize script to use the GPU. Currently it uses only the CPU, so it is much slower.
Thank you for a hint! I remain a fan of this project. I'll try to feed text2mel to WaveGlow or/and WaveGAN vocoders. The Tacotron2 realizations are so hardware costly and long term sometimes with unpredictable results for less common languages.
Dear @tugstugi
The speech synthesis with GPU is about 5 times faster than CPU: 200 characters within 8-9 seconds on Nvidia RTX 2080 12GB (single GPU). I may decrease execution time with some cosmetic optimization, but I would like o know Is there a chance to accelerate it more with basic code modification? If yes, how much decrease in duration possible?
Meantime I had tried your waveglow branch, but I failed train-text2mel.py with --warm-start because of
AttributeError: 'dict' object has no attribute 'state_dict'
error.
It seems my working text2mel model is so modified.
Therefore I made decision to train from scratch, but the training interrupted with "CUDA out of memory" error. Could you advise how to reduce GPU consumption? Is there "batch_size" like parameter in hparams.py?
Last my question: Are following parameters valuable sensitive for synthesis quality?
dropout_rate = 0.05 # dropout
text2mel_lr = 0.005
ssrn_lr = 0.0005
Thank you in advance!
The text2mel is slow because it uses the generated mels from the previous steps in auto regressive manner. Maybe batch synthesis would help you if you want to synthesize multiple sentences.
The batch size is hardcoded here: https://github.com/tugstugi/pytorch-dc-tts/blob/master/train-text2mel.py#L36
Without dropout, it will overfit to the training set because there is no data augmentation.
Thank you very much!
The reduction of batch_size from 64 to 32 is useful.
I'll train from scratch in waveglow branch and compare the old results.
I am really interested to lean all capabilities of this project expecting it maybe good start for serious development of TTS solution for less common languages.
Hi @tugstugi ,
I am trying to feed text2mel output to the trained waveglow model. Below is a descripton of my steps:
python train-text2mel.py --dataset=ljspeech --warm-start=ljspeech-text2mel.pth python train-ssrn.py --model=SSRNv2 --dataset=ljspeech --warm-start=ljspeech-ssrn.pth
audio = waveglow.infer(Y, sigma=0.65) audio = audio.data.cpu().numpy()[0] wav = signal.lfilter([1], [1, -hp.preemphasis], audio) librosa.output.write_wav('samples/%d-wav.wav' % (i + 1), wav, sr=hp.sr)
and
waveglow feeding and voice generation piece in the loop
Y = _fix_mel(Y) audio = waveglow.infer(Y, sigma=0.65) audio = audio.data.cpu().numpy()[0] wav = signal.lfilter([1], [1, -hp.preemphasis], audio) librosa.output.write_wav('samples/%d-wav.wav' % (i + 1), wav, sr=hp.sr)
The code is working, but it generates poor and very quicky playing wav file. You may hear it here https://arm.ican24.net/wavesurfer1.php
save_to_png('samples/%d-mel.png' % (i + 1), Y[0, :, :])
fails with error
ValueError: Images of type float must be between -1 and 1.
Same for "wav" variable.
Surely my code is not perfect (just a sketch), but before making any further steps I need your advice to be sure that this way is possible to reach the result. Also I remember that NVidia Tacotron2 like normalization is required.
I am attaching synthesize.py and sample wav file. waveglow_synth.zip
If you need my checkpoints for more serious investigation you may download them from here: text2mel-step-55K.pth https://drive.google.com/open?id=1-3jSdrJunFozWntyZwTfE5MmaG7LHmHQ ssrn-step-55K.pth: a bit later, it is yet uploading
Thank you in advance!
P.S. Waveglow was tested with custom language and dataset too. The quick speech is understandable despite of arteffects. So it may be a great achivement of this project, if you help to fix the issue of the sound generation algorithm. Hencefore the TTS developers may save ton of time and nerves in speech synthesis without extremely costly hardware.
Your synthesize code looks wrong, do this way:
_, Z = ssrn(Y.cuda())
audio = waveglow.infer(_fix_mel(Z).cuda(), sigma=0.65)
You fed the reduced mel directly to the waveglow -> generated WAV is 4 times faster.
Thank you very much! It works. Below is generated sample https://arm.ican24.net/wavesurfer1.php
How we can tune speech quality: to train it else more than 55K or there are possible tricks like normalization and others?
@ican24 your link has still the fast WAV file. Could you upload the new generated wav?
Well you can to preprocess the wav files like the waveglow and train again. But it needs many code changes i.e. you have to use logits insead of the outputs after the sigmoid layer because the tacotron/waveglow mels are not normalized between 0 and 1.
It sounds for me difficult. Frankly my experience in ML is not so long, but I will try to go ahead toward this aim, because the language projects are so important for our initiative, which is an absolutely public (not commercial nor state supported). This project could be a breakthrough in contemporary TTS development, where the big players like Google, Nvidia, Amazon, Facebook are dominated trying to press and push back others with the costly hardware tools and conditions. I'll be so appreciated, if you share your ideas time by time. Thank you
Hello @ican24 I make other language too, but no make good Vocab file. Can you share your file, I see how make this? I work language chinese and some other.
Hey!
I've trained text2mel part for mel generation for couple hundred epochs. Model seems to be learning, and it gives somehow good results on different language dataset while fed to SSRN (without fine-tuning the SSRN part). I'm trying to feed the text2mel output to trained WaveGlow model, but it outputs just low-frequency noise, without any speech.
Any advices how to post-process the generated mels to feed them to WaveGlow?