mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.19k stars 1.23k forks source link

[Discussion] WaveGrad #518

Closed george-roussos closed 3 years ago

george-roussos commented 3 years ago

This is not an issue and is more of a discussion. I read the WaveGrad paper today (which may be found here) and listened to the samples here, which sound very good. There seems to be an open source implementation already here with great progress. Has anyone read the paper or used this implementation?

erogol commented 3 years ago

in the repo it says 2 days of training is enough for convergence which is a huge time reduction, however, it is unclear how fast it is on a CPU.

I skim the paper quickly. Probability diffusion and Lavegin dynamics are topics that I am not really familiar but the main idea looks similar to normalizing flows with its iterations having a resemblance to the flow steps.

george-roussos commented 3 years ago

I saw the same! 2 days then it seems that it should be even faster than the GANs and no finetuning needed if you want less iterations for faster inference. I don't think it can use a CPU though, because the repo mentions grid search on CPU is slow, so I would imagine even 6 iterations (so lowest quality) would take longer on CPU. I also tried to see if it works with spectrograms synthesized by mozilla TTS but didn't get far. But even the 6 iters sample sounds nice

erogol commented 3 years ago

CPU is a blocker for me to consider it since MB-MelGAN is already faster than real-time on a CPU. But 2 days of training is really impressive for a vocoder. Now I wait who will write the paper using GAN with this model and report better results :)

george-roussos commented 3 years ago

I agree. They also mention a 2080ti for inference, so I guess it is just too expensive.

Edresson commented 3 years ago

@freds0 and I supported the TTS audio configuration on this fork here . We adapted the upsample factors of the model to our hop_lenght. We are training the model on the train-clean-100 and 360 subsets of the LibriTTS dataset and we hope to have good results in the coming days. We intend to provide the checkpoint. On the other hand, we are also trying to train WaveGlow with the TTS audio configuration. So we would have 2 SOTA vocoders.

george-roussos commented 3 years ago

Have you tried inference with 6 iterations and a number between 25 and 50? How does it sound? Is it fast?

Edresson commented 3 years ago

Have you tried inference with 6 iterations and a number between 25 and 50? How does it sound? Is it fast?

I tested an 11-second audio on Google Colab with 6 interactions: GPU: 0.41608357429504395 seconds CPU: 77.3947184085846 seconds

Here is a sample of the model trained for 1 day, in the TTS audio configuration

george-roussos commented 3 years ago

It sounds nice for 6. And the other repo is also updated to show inference on CPU for 6 iterations.

Is it plugged in TTS the same way as WaveRNN?

Edresson commented 3 years ago

@george-roussos Yes, integration is very simple. I used this implementation and not the one that is more active because it provided a multispeaker model for transfer learning.

I synthesized some new samples with our best multispeaker model. These voices are voices generated with random vectors, that is, they are artificial voices. The other samples were from a model that was not very good.

wavegrad-english-sample-with-our-better-model.zip

george-roussos commented 3 years ago

Do you have a notebook somewhere that you run it?

Edresson commented 3 years ago

You can download the current WaveGrad checkpoint at the link: https://drive.google.com/uc?id=1Te-9YaUirTGOa2syq2uQwBNU1vbDlUN0 (gdown https://drive.google.com/uc?id=1Te-9YaUirTGOa2syq2uQwBNU1vbDlUN0)

You must install WaveGrad with the following command: pip install git+https://github.com/freds0/wavegrad.git

Then, just use the wavegrad_predict function by passing the transposed Mel spectrogram (spec.T) and the device.

import os
import torch
import torchaudio
import time
from tqdm import tqdm

from argparse import ArgumentParser

from wavegrad.params import AttrDict, params as WAVEGRAD_PARAMS
from wavegrad.model import WaveGrad

# wavegrad 
WAVEGRAD_VOCODER_PATH = './'

if os.path.exists(f'{WAVEGRAD_VOCODER_PATH}/weights.pt'):
  checkpoint = torch.load(f'{WAVEGRAD_VOCODER_PATH}/weights.pt')
else:
  checkpoint = torch.load(WAVEGRAD_VOCODER_PATH)

WAVEGRAD_MODEL = WaveGrad(AttrDict(WAVEGRAD_PARAMS))
WAVEGRAD_MODEL.load_state_dict(checkpoint['model'])
WAVEGRAD_MODEL.eval()
print("WaveGrad load !")

WAVEGRAD_PARAMS['noise_schedule'] = [4.47739327e-06, 4.47739327e-05, 9.49513587e-04, 9.49513587e-03, 9.49513587e-02, 4.47739327e-01] # get this with noise_schedule search

def wavegrad_predict(spectrogram, device=torch.device('cuda')):
  start = time.time()
  # Lazy load model.
  model = WAVEGRAD_MODEL.to(device)
  model.params.override(WAVEGRAD_PARAMS)
  with torch.no_grad():
    beta = np.array(model.params.noise_schedule)
    alpha = 1 - beta
    alpha_cum = np.cumprod(alpha)

    # Expand rank 2 tensors by adding a batch dimension.
    if len(spectrogram.shape) == 2:
      spectrogram = spectrogram.unsqueeze(0)
    spectrogram = spectrogram.to(device)

    audio = torch.randn(spectrogram.shape[0], model.params.hop_length * spectrogram.shape[-1], device=device)
    noise_scale = torch.from_numpy(alpha_cum**0.5).float().unsqueeze(1).to(device)

    for n in tqdm(range(len(alpha) - 1, -1, -1)):
      c1 = 1 / alpha[n]**0.5
      c2 = (1 - alpha[n]) / (1 - alpha_cum[n])**0.5
      audio = c1 * (audio - c2 * model(audio, spectrogram, noise_scale[n]).squeeze(1))
      if n > 0:
        noise = torch.randn_like(audio)
        sigma = ((1.0 - alpha_cum[n-1]) / (1.0 - alpha_cum[n]) * beta[n])**0.5
        audio += sigma * noise
      audio = torch.clamp(audio, -1.0, 1.0)
  print('Wavegrad Time', time.time()-start)
  return audio.cpu().numpy(), model.params.sample_rate
george-roussos commented 3 years ago

Thanks, I will try it! But do you only extract the spectrogram from TTS?

torch.FloatTensor(mel_postnet_spec.T) np.save("spectrogram.npy", spectrogram)

Edresson commented 3 years ago

That's right.

The model is compatible with the following configuration:

 // AUDIO PARAMETERS
    "audio":{
        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame. 
        "fft_size": 1024,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "win_length": 1024,     // stft window length in ms.
        "hop_length": 256,      // stft window hop-lengh in ms.
        "frame_length_ms": null,  // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,   // stft window hop-lengh in ms. If null, 'hop_length' is used.
        "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        "stft_pad_mode": "reflect",
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": 0.0,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,        // maximum freq level for mel-spec. Tune for dataset!!
        "spec_gain": 20.0, 
        "do_trim_silence": false,  // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60          // threshold for timming silence. Set this according to your dataset.
    },
george-roussos commented 3 years ago

Coolio, it worked yay! Although it gives me strong vocal fry in low frequencies, so I wonder if it is my TTS or WaveGrad. Oh well!

erogol commented 3 years ago

thanks @Edresson

WeberJulian commented 3 years ago

Coolio, it worked yay! Although it gives me strong vocal fry in low frequencies, so I wonder if it is my TTS or WaveGrad. Oh well!

You can see if the noise comes from your TTS or the vocoder by computing the mel spectogram with the audio processor and the passing it trough the vocoder.

george-roussos commented 3 years ago

Coolio, it worked yay! Although it gives me strong vocal fry in low frequencies, so I wonder if it is my TTS or WaveGrad. Oh well!

You can see if the noise comes from your TTS or the vocoder by computing the mel spectogram with the audio processor and the passing it trough the vocoder.

This is what I am doing right now. I extract the spectrogram from TTS with synthesize.py and then I pass it through WaveGrad. Are you on about something different?

WeberJulian commented 3 years ago

No compute the mel directly from the wav sample without using the TTS model. That way you're only testing the vocoder independently from the TTS model.

george-roussos commented 3 years ago

Oh you mean a ground truth sample? Good idea! I will try it. Thanks :)

Edresson commented 3 years ago

WaveGrad is still training, I will keep it for some more time. After I update the checkpoint here.

george-roussos commented 3 years ago

Thanks Edresson. What are your impressions? I think it is very promising quality-wise, but it was kind of slow when I tried to synthesize a 20-sec spectrogram on a V100. I also only trained for one day on one speaker and I used 1000 iterations.

I think I will try PWGAN with LibriTTS next week using the configuration you used for WaveGrad and maybe Eren's resnet changes.

Edresson commented 3 years ago

Did you see the Full-Band MelGAN (Universal) that Eren release recently? Its's really fast.

I'm only training WaveGrad because of the quality :).

george-roussos commented 3 years ago

I did! It's very good. But there are many differences to my TTS and I think ParallelWaveGAN sounds a bit more natural.

george-roussos commented 3 years ago

I ran some tests with FullBand MelGan universal, WaveGrad universal (thanks @erogol and @Edresson) and a ParallelWaveGAN (single speaker) I trained. Setup is CPU and same sentence:

FullBand Melgan:

Run-time: 13.553260803222656 Real-time factor: 0.5267899864418801 Time per step: 2.3890714941122688e-05

ParallelWaveGAN:

Run-time: 58.56965398788452 Real-time factor: 2.7854093622135445 Time per step: 0.00012632243959543506

WaveGrad:

Run-time: 131.25670504570007 Real-time factor: 2894210.104429722 Time per step: 131.25670289993286

WaveGrad ran with Edresson's final model and a custom noise schedule for my dataset, 6 iterations. With strength of preference in terms of quality, I choose

WaveGrad > ParallelWaveGAN > FullBand Melgan. However, I think pwGAN really is the sweet spot with quality and speed.

I ran this on a Mac and torch audio kept throwing an error about a library. If you run into the same problem you can run

sudo install_name_tool -change @rpath/libc++.1.dylib /usr/lib/libc++.1.dylib /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torchaudio/_torchaudio.so

Edresson commented 3 years ago

This model is universal, that is, it works for speakers not seen in training. This model is a modified version of the WaveGrad vocoder. As our TTS model was trained using a length of 256 hops, instead of 300 as reported in the original vocoder paper, we had to change the upsampling factors to WaveGrad five blocks of upsampling, changing factors 5, 5, 3, 2, 2 to 4, 4, 4, 2, 2. In addition, we trained WaveGrad with a sample rate of 22 kHz instead of 24 kHz. The model was trained in 832k steps, with a batch size of 200 and a learning rate of 0.0002, using the train-clean-100 and train-clean-360 subsets of the LibriTTS dataset.

The speed of waveform generation depends on the number of iterations of the noise schedule. In my initial tests the use of 6 iterations brings a good compromise between quality and speed. I adjusted the noise schedule in the VCTK dataset (I did this because our multi-speaker model is currently trained in this dataset). Depending on the differences in the dataset you can adjust the noise schedule for your dataset to obtain better quality and less noise in the synthesis.

Below the table with the vectors for the noise schedule are:

Number iters Numpy Array
4 https://drive.google.com/file/d/1o6_2LFPObxTdCE9RlDfqJZ0XfYDwUWnd/view?usp=sharing
6 https://drive.google.com/file/d/1ni_yWbNypPkl6a-rOwIADrqMmhBp6X9K/view?usp=sharing
10 https://drive.google.com/file/d/1R6JlgoQ3SgyEXN-7DoKzN4NS7lnsmlTK/view?usp=sharing
20 https://drive.google.com/file/d/1TCAdo-BDUmm9Gn6Gcy5sl-awvWoUrMKx/view?usp=sharing
50 https://drive.google.com/file/d/1tFIUbpdK45VPDNAOEIhwpfrTP_xjYwnO/view?usp=sharing

Colab example notebook for use with Mozilla TTS: https://colab.research.google.com/drive/1oBFvgVFbR0g8f_dJ4eDW5dbT-Fk7H0kc?usp=sharing

Github repository: https://github.com/freds0/wavegrad

WavGrad's checkpoint and tensorboard logs: https://drive.google.com/drive/folders/1ZfJ_Bb2y_VHZYBZof6HVbnPw4mlDOt6v?usp=sharing

thorstenMueller commented 3 years ago

Hello.

I’m experimenting with WaveGrad implementation of https://github.com/ivanvovk/WaveGrad/. Thanks to nice support by @ivanvovk training is currently running on my public „thorsten“ german dataset.

Parameters has been adjusted to match our taco2 training params. Is the model (once it's ready) compatible with Mozilla TTS or do i need to use repo from @freds0 for Mozilla TTS usage?

Here adjusted params of WaveGrad:

"factors": [4, 4, 4, 2, 2],
"data_config": {
        "sample_rate": 22050,
        "n_fft": 1024,
        "win_length": 1024,
        "hop_length": 256,
        "f_min": 0.0,
        "f_max": 8000,
        "n_mels": 80
    },
"batch_size": 48,
        "segment_length": 7168,
        "lr": 1e-3,
"use_fp16": false,

Training is currently at 28 epochs and sounds not too bad.

mels scalars

Edresson commented 3 years ago

@thorstenMueller That depends, because we use some extra normalizations in the Mel Spectrograms so without those normalizations, it might not work well. I made a universal model available above, do you have any special reason to train your own?

george-roussos commented 3 years ago

Edresson's model works nicely for unseen speakers. However, if you do have a spare GPU, I may also suggest checking out HiFiGAN. I have been getting the best results using a GAN. There's an initial implementation on my git and one on Edresson's too and they both support Mozilla TTS spectrograms.

thorstenMueller commented 3 years ago

Thanks @Edresson and @george-roussos for your quick feedback.

I started my own training because i thought all audio params must exactly match our taco2 model training. Maybe this was a misunderstanding. Is there an easy way to use your above checkpoints with Mozilla TTS and our dataset (i haven't found anything WaveGrad related here: https://github.com/mozilla/TTS/tree/dev/TTS/vocoder/configs). Otherwise i'll be playing around with your notebook.

When i got @domcross right he's planning to play around with HifiGAN on our german dataset.

thorstenMueller commented 3 years ago

Hi @Edresson . I copied your notebook and adjusted it (as far as i understood your logic) for usage with my taco2 single speaker model. But the results sounds very scratchy. Would you mind taking a look at my adjusted notebook. Maybe you've an idea what the problem might be.

https://colab.research.google.com/drive/1uZHCcmLoNckU1dDgkKspg3qXct9oRW4W?usp=sharing

lexkoro commented 3 years ago

@thorstenMueller the WaveGrad model was trained with preemphasis=0.98, your tts model has to match those values

thorstenMueller commented 3 years ago

Thanks @SanjaESC . That's the point 👍 . Our taco2 model has set preemphasis to 0.0 in config.

What would your recommendation be: a) Train new taco2 model with preemphasis=0.98 b) Train WaveGrad vocoder with preemphasis=0.0

Hop length (256) and sr is matching (22khz). If this i going to be offtopic i'd open a new issue or ask in mozilla discourse for help. I do not want to hijack this issue.

george-roussos commented 3 years ago

Try finetuning the TTS with 0.98 for a few steps. I did it with mine when it was 1100/275 and wanted 1024/256 and it worked fine. :)

lexkoro commented 3 years ago

@thorstenMueller Try finetuning first as suggested by @george-roussos.

I think it's best to adjust the vocoder to match the tts settings as it is the "important" part of speech generation.

thorstenMueller commented 3 years ago

I've been in discussion with @olafthiele and @domcross from our TTS group and we would continue our taco2 model training with adjusted params. I checked params from our model with params from @Edresson (https://github.com/mozilla/TTS/issues/518#issuecomment-696122892) and found following values to be different.

Can we/should we adjust all of these before continuing training?

Param Value thorsten model Value Edresson model
stft_pad_mode not existing reflect
max_norm 1.0 4.0
do_trim_silence true false
preemphasis 0.0 0.98

How many steps would you recommend @george-roussos or @SanjaESC? Model has been trained with 460k steps.

george-roussos commented 3 years ago

I've been in discussion with @olafthiele and @domcross from our TTS group and we would continue our taco2 model training with adjusted params. I checked params from our model with params from @Edresson (#518 (comment)) and found following values to be different.

Can we/should we adjust all of these before continuing training?

Param Value thorsten model Value Edresson model stft_pad_mode not existing reflect max_norm 1.0 4.0 do_trim_silence true false preemphasis 0.0 0.98 How many steps would you recommend @george-roussos or @SanjaESC? Model has been trained with 460k steps.

It should definitely work within 50K steps, if not less. When I did the finetuning from 1100/275 to 1024/256 it took 50K steps to be fully okay -- before that the sound was too fast because of the smaller sizes. The trim_silence matters if you want to train with GTA features.

Good luck! 😀

erogol commented 3 years ago

Currently, we have wavegrad implementation in Mozilla TTS dev branch for anyone who likes to give it a shot. I should say it is the best quality model we trained so far.

nmstoker commented 3 years ago

Thanks @erogol Having a go training with wavegrad right now. I'll monitor progress and quality, but any general suggestions on how many steps I should expect to run it for?

oytunturk commented 3 years ago

Currently, we have wavegrad implementation in Mozilla TTS dev branch for anyone who likes to give it a shot. I should say it is the best quality model we trained so far.

Thanks @erogol ! I ran some quick experiments and quality is indeed much better than previous recipes I've tried here for multi-speaker TTS. It's definitely slower, ~x5 slower on GPU as compared to the fastest recipes I checked. I'm seeing rtf ranges 1.8-2.9 when running two synthesis instances on a single Titan X.

thorstenMueller commented 3 years ago

Hi guys. @olafthiele continued our taco2 training with adjusted params for further 50k steps (now 510k total taco2 steps). Just following two params differ:

Sadly the quality didn't improve even we use new model, config and scale_stats. I try running an adjusted notebook from @Edresson . Seee my test notebook here https://colab.research.google.com/drive/1uZHCcmLoNckU1dDgkKspg3qXct9oRW4W?usp=sharing .

I'm not sure on what to do. Is the pretrained vocoder model incompatible with our taco2 traininig or did i make a mistake on adjusting the existing notebook from @Edresson .

> Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.98
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:True
 | > stats_path:./scale_stats.npy
 | > hop_length:256
 | > win_length:1024
 > Using model: Tacotron2
use_external_speaker_embedding_file set in config file

I've uploaded three audio samples and 1_spec.pt files. Maybe anyone has an idea what the problem might be.

WaveGradPretrainedThorstenTaco2.zip

george-roussos commented 3 years ago

If you want to train from scratch, it doesn't taker longer than 2 days to reach full convergence 🙂

nmstoker commented 3 years ago

I ran into a minor issue when trying to use --continue_path in training as it didn't seem to set the optimiser correctly. It was late at night so I commented out the resetting of the optimiser, which let it carry on. (similar issue seems to have come up with the other vocoders but I couldn't quite infer what the exact problem was).

Now I've gone to test the vocoder with server.py and it seems to have an issue with the noise_level not being set. I will update this with better details shortly (eg Sunday morning) but thought I'd mention it in case others had similar issues. It's posible the second problem a side-effect of my optimiser fix so I'll try digging into that tomorrow.

  File "/home/neil/main/Projects/TTSNov2020/TTS/TTS/vocoder/models/wavegrad.py", line 92, in inference
    sqrt_alpha_hat = self.noise_level.to(x)
AttributeError: 'NoneType' object has no attribute 'to'
nmstoker commented 3 years ago

I haven't quite figured out what I need to do to get the noise_level set (so the .to(x) part can work), but here is the additional info on my setup that I didn't have time to add last night.

In the end, to rule out the risk that me messing with cutting out the optimiser was behind the noise_level issue, I simply restarted my training from scratch and waited till it got to 75k steps to try again. Before restarting I updated from the repo so I got the most recent changes applied early yesterday (which had been added right after my first training run above).

The vocoder settings are below. Mostly as it is in the standard config file, except minor adjustments simply to make it match the TTS settings for the Taco2 TTS model I'm hoping to use with it.

Command line for server.py is included here but it and the output including the error are in the log file details below (if you expand the section)

python TTS/server/server.py --tts_config ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/config.json --tts_checkpoint ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/checkpoint_408000.pth.tar --vocoder_config /home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/config.json --vocoder_checkpoint /home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/best_model.pth.tar --debug False --use_cuda True

Details

Platform OS

Python Environment

Package Installation

Configuration

Click to see config file (lines: 117)
:page_facing_up: Contents from: **/home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/config.json** ````javascript { "github_branch":"* dev", "run_name": "wavegrad-neil18", "run_description": "wavegrad neil18", "audio":{ "fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame. "win_length": 1024, // stft window length in ms. "hop_length": 256, // stft window hop-lengh in ms. "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used. "frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used. // Audio processing parameters "sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled. "preemphasis": 0.99, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis. "ref_level_db": 0, // reference level db, theoretically 20db is the sound of air. // Silence trimming "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true) "trim_db": 60, // threshold for timming silence. Set this according to your dataset. // MelSpectrogram parameters "num_mels": 80, // size of the mel spec frame. "mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!! "mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!! "spec_gain": 1.0, // scaler value appplied after log transform of spectrogram. // Normalization parameters "signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params. "min_level_db": -10, // lower bound for normalization "symmetric_norm": true, // move normalization to range [-1, 1] "max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm] "clip_norm": true, // clip normalized values into the range. "stats_path": null //"stats_path": "/home/erogol/Data/libritts/LibriTTS/scale_stats_wavegrad.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored }, // DISTRIBUTED TRAINING "mixed_precision": true, // enable torch mixed precision training (true, false) "distributed":{ "backend": "nccl", "url": "tcp:\/\/localhost:54322" }, "target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch // MODEL PARAMETERS "generator_model": "wavegrad", "model_params":{ "use_weight_norm": true, "y_conv_channels":32, "x_conv_channels":768, "ublock_out_channels": [512, 512, 256, 128, 128], "dblock_out_channels": [128, 128, 256, 512], "upsample_factors": [4, 4, 4, 2, 2], "upsample_dilations": [ [1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]] }, // DATASET "data_path": "/home/neil/data/Projects/NeilTTS/neil18/wavs_wavegrad/", // root data path. It finds all wav files recursively from there. "feature_path": null, // if you use precomputed features "seq_len": 6144, // 24 * hop_length "pad_short": 0, // additional padding for short wavs "conv_pad": 0, // additional padding against convolutions applied to spectrograms "use_noise_augment": false, // add noise to the audio signal for augmentation "use_cache": false, // use in memory cache to keep the computed features. This might cause OOM. "reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers. // TRAINING "batch_size": 96, // Batch size for training. "train_noise_schedule":{ "min_val": 1e-6, "max_val": 1e-2, "num_steps": 1000 }, "test_noise_schedule":{ "min_val": 1e-6, "max_val": 1e-2, "num_steps": 50 }, // VALIDATION "run_eval": true, // enable/disable evaluation run // OPTIMIZER "epochs": 10000, // total number of epochs to train. "clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0 "lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "lr_scheduler_params": { "gamma": 0.5, "milestones": [100000, 200000, 300000, 400000, 500000, 600000] }, "lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate. // TENSORBOARD and LOGGING "print_step": 50, // Number of steps to log traning on console. "print_eval": false, // If True, it prints loss values for each step in eval run. "save_step": 5000, // Number of training steps expected to plot training stats on TB and save model checkpoints. "checkpoint": true, // If true, it saves checkpoints per "save_step" "tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. // DATA LOADING "num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values. "num_val_loader_workers": 4, // number of evaluation data loader processes. "eval_split_size": 256, // PATHS "output_path": "/home/neil/main/Projects/TTSNov2020/models/" } ````

Logfile

Click to see log file (lines: 70)
:page_facing_up: Logfile: **/home/neil/main/Projects/TTSNov2020/models/server_problem.log** (tts_nov2020) [neil@ramandu TTS]$ python TTS/server/server.py --tts_config ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/config.json --tts_checkpoint ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/checkpoint_408000.pth.tar --vocoder_config /home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/config.json --vocoder_checkpoint /home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/best_model.pth.tar --debug False --use_cuda True 2020-11-15 21:20:16.134403: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/kaldi/tools/openfst/lib 2020-11-15 21:20:16.134427: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Namespace(debug=False, is_wavernn_batched=False, port=5002, tts_checkpoint='../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/checkpoint_408000.pth.tar', tts_config='../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/config.json', tts_speakers=None, use_cuda=True, vocoder_checkpoint='/home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/best_model.pth.tar', vocoder_config='/home/neil/main/Projects/TTSNov2020/models/wavegrad-neil18-November-14-2020_11+28PM-a2a142d/config.json', wavernn_checkpoint=None, wavernn_config=None, wavernn_lib_path=None) > Loading TTS model ... | > model config: ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/config.json | > checkpoint file: ../../TTSOct2020/models/neil18/neil18_2-September-27-2020_01+10AM-665f7ca/checkpoint_408000.pth.tar > Setting up Audio Processor... | > sample_rate:22050 | > num_mels:80 | > min_level_db:-10 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.8 | > preemphasis:0.99 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:0 | > mel_fmax:8000.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > stats_path:None | > hop_length:256 | > win_length:1024 > Using model: Tacotron2 > model reduction factor: 1 > Generator Model: wavegrad * Serving Flask app "server" (lazy loading) * Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Debug mode: off [INFO] * Running on http://0.0.0.0:5002/ (Press CTRL+C to quit) [INFO] 192.168.1.66 - - [15/Nov/2020 21:20:33] "GET / HTTP/1.1" 200 - [INFO] 192.168.1.66 - - [15/Nov/2020 21:20:34] "GET /favicon.ico HTTP/1.1" 404 - > Model input: How are you today? ['How are you today?'] [ERROR] Exception on /api/tts [GET] Traceback (most recent call last): File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/_compat.py", line 39, in reraise raise value File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/Flask-1.1.2-py3.7.egg/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "TTS/server/server.py", line 77, in tts data = synthesizer.tts(text) File "/home/neil/main/Projects/TTSNov2020/TTS/TTS/server/synthesizer.py", line 154, in tts wav = self.vocoder_model.inference(vocoder_input) File "/home/neil/.conda/envs/tts_nov2020/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/neil/main/Projects/TTSNov2020/TTS/TTS/vocoder/models/wavegrad.py", line 92, in inference sqrt_alpha_hat = self.noise_level.to(x) AttributeError: 'NoneType' object has no attribute 'to' [INFO] 192.168.1.66 - - [15/Nov/2020 21:20:39] "GET /api/tts?text=How%20are%20you%20today%3F HTTP/1.1" 500 -

- generated at 21:31 on Nov 15 2020 using Gather Up tool :gift:

lexkoro commented 3 years ago

I don't think WaveGrad functionality was added to the server yet. To compute the noise schedule you could try adding self.vocoder_model.compute_noise_level(50, 1e-6, 1e-2) at line 92 in synthesizer.py.

def load_vocoder(self, model_file, model_config, use_cuda):
        self.vocoder_config = load_config(model_config)
        self.vocoder_model = setup_generator(self.vocoder_config)
        self.vocoder_model.load_state_dict(torch.load(model_file, map_location="cpu")["model"])
        self.vocoder_model.remove_weight_norm()
        self.vocoder_model.inference_padding = 0
        self.vocoder_config = load_config(model_config)

        if use_cuda:
            self.vocoder_model.cuda()
        self.vocoder_model.eval()
        self.vocoder_model.compute_noise_level(50, 1e-6, 1e-2)

There is also https://github.com/mozilla/TTS/blob/dev/TTS/bin/tune_wavegrad.py which will compute a noise schedule adapted to the trained model and your dataset. But the above should work for a start.

nmstoker commented 3 years ago

Thanks @SanjaESC - that's very kind of you, it has done the trick!

I reverted to c80225544e2fb43abbccd94148cc2045d95f8f63 as the most recent commit to allow flexibility with the beta details in compute_noise_level had changed the parameters it expected. (Edit: I realise now I could've gone to a newer commit, so long as it was before the ones in Saturday morning)

The quality seems good - I experimented with the number of steps; 50 to 100 seems decent, 10 has a lot more noise and above 100 wasn't noticeably doing anything for the quality. I suspect my vocoder needs to train for longer still, as there's a subtle hissing between words. I'd broken off training a little early (to test it at 75k steps) which is around 20ish hours on my 1080Ti.

Here's a sample: https://soundcloud.com/user-726556259/sherlock-wavegrad-sample

thorstenMueller commented 3 years ago

I've started a wavegrad training on a taco2 model (taco2 "thorsten" training for 510k steps). After 27 hours (epoch 64) it stopped with following error:

   --> STEP: 173/233 -- GLOBAL_STEP: 525150
     | > wavegrad_loss: 0.04379  (0.04557)
     | > step_time: 2.93
     | > loader_time: 1.4346
     | > current_lr: 0.0001
     | > grad_norm: 1.336453914642334

   --> STEP: 223/233 -- GLOBAL_STEP: 525200
     | > wavegrad_loss: 0.03929  (0.04581)
     | > step_time: 2.94
     | > loader_time: 0.0206
     | > current_lr: 0.0001
     | > grad_norm: 2.0447957515716553
[WARNING] NaN or Inf found in input tensor.

   --> TRAIN PERFORMACE -- EPOCH TIME: 688.11 sec -- GLOBAL_STEP: 525210
     | > avg_wavegrad_loss: 0.04597
     | > avg_loader_time: 0.47541
     | > avg_step_time: 2.94022

[WARNING] NaN or Inf found in input tensor.
 ! Run is kept in /home/thorsten/___prj/tts/models/thorsten-wavegrad/
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 502, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 401, in main
    epoch)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 223, in train
    tb_logger.tb_model_weights(model, global_step)
  File "/home/thorsten/___prj/tts/mozilla/TTS/TTS/utils/tensorboard_logger.py", line 35, in tb_model_weights
    "layer{}-{}/grad".format(layer_num, name), param.grad, step)
  File "/home/thorsten/___prj/tts/mozilla/lib/python3.6/site-packages/tensorboardX/writer.py", line 503, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "/home/thorsten/___prj/tts/mozilla/lib/python3.6/site-packages/tensorboardX/summary.py", line 210, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "/home/thorsten/___prj/tts/mozilla/lib/python3.6/site-packages/tensorboardX/summary.py", line 248, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.

I've changed following two config values and started a new run. Hopefully this will help.

lr: (old): 1e-4 ---- (new): 5e-4
scheduler gamma: (old) 0.5 --- (new): 0.9

Do you have any suggestion on that?

thorstenMueller commented 3 years ago

Training failed again (this time already in epoch 9).

 > EPOCH: 9/10000

 > TRAINING (2020-11-22 12:14:53) 

   --> STEP: 43/233 -- GLOBAL_STEP: 512150
     | > wavegrad_loss: 0.05503  (0.05862)
     | > step_time: 2.97
     | > loader_time: 0.0173
     | > current_lr: 0.0005
     | > grad_norm: 1.5453391075134277

   --> STEP: 93/233 -- GLOBAL_STEP: 512200
     | > wavegrad_loss: 0.07329  (0.06146)
     | > step_time: 2.99
     | > loader_time: 0.0398
     | > current_lr: 0.0005
     | > grad_norm: 3.6150529384613037
[WARNING] NaN or Inf found in input tensor.
 ! Run is kept in /home/thorsten/___prj/tts/models/thorsten-wavegrad/
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 502, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 401, in main
    epoch)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 128, in train
    raise RuntimeError(f'Detected NaN loss at step {global_step}.')
RuntimeError: Detected NaN loss at step 512217.

I encountered these "NaN" issues while trying another WaveGrad implementation (see this issue: https://github.com/ivanvovk/WaveGrad/issues/8#issuecomment-706767562).

Should i upload my config.json for vocoder training or which parameters would be helpful to share for further analysis (batchsize is set to 96 right now).

@george-roussos or @erogol : Do you have any ideas? How do i know which values for lr, etc. are working?

george-roussos commented 3 years ago

I never received this, but I have only trained using the fork here https://github.com/freds0/wavegrad Then I took the spectrogram from TTS and passed it there

Can you check if you are using the original LR etc. values?

thorstenMueller commented 3 years ago

Compared with original config file (https://github.com/mozilla/TTS/blob/dev/TTS/vocoder/configs/wavegrad_libritts.json) i've changed following config values so it matches our taco2 training config:

"restore_path":"/home/thorsten/___prj/tts/models/thorsten-wavegrad/wavegrad510.pth.tar",
"github_branch":"* dev",
"sample_rate": 22050,  (ORIG: 24000)
"preemphasis": 0.98, (ORIG: 0.0)
"ref_level_db": 20, (ORIG: 0)
"do_sound_norm": true, (ORIG: Key in config not defined)
"mel_fmin": 0.0, (ORIG: 50.0)
"mel_fmax": 8000.0, (ORIG: 7600.0)
"spec_gain": 20.0, (ORIG: 1.0)
"max_norm": 1.0, (ORIG: 4.0)
"stats_path": "/home/thorsten/___prj/tts/models/thorsten-wavegrad/wg_scale_stats.npy",
"data_path": "/home/thorsten/___prj/tts/datasets/thorsten-de_v02/",
"gamma": 0.9, (ORIG: 0.5)
"lr": 5e-4, (ORIG: 1e-4)
"output_path": "/home/thorsten/___prj/tts/models/thorsten-wavegrad/output"

@george-roussos As far i know vocoder training should match values from taco2 training, or? So resetting all values to original values as defined in Mozilla tts repo might not be a good idea.

george-roussos commented 3 years ago

These definitely have to match. But a NaN error means a gradient problem (please someone else correct me if wrong). Can you check if it happens with the original LR and gamma values? If everything fails, you can always use the fork I linked (it uses Mozilla standardization), then grab the spectrogram from synthesize.py and feed it. It will work the same.