mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.38k stars 1.25k forks source link

Tacotron 2 LJ run with current version and r=1 #134

Closed m-toman closed 4 years ago

m-toman commented 5 years ago

Hi,

just tried running the LJ dataset on characters (not phonemes as I would like to have a comparison to an existing model I have) with r=1 and your BatchNorm version (latest dev-tacotron2 branch version) and after more than 70k steps I still get alignments like these:

image

Have you seen something like this before?

Accordingly, audio is just babbling for non-teacher-forced data.

The spectrograms look like this: image

I assume the blank parts to the right are padding?

`

"run_name": "LJ-r1",
"run_description": "training r=1",

"audio":{
    // Audio processing parameters
    "num_mels": 80,         // size of the mel spec frame. 
    "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
    "sample_rate": 22050,   // wav sample-rate. If different than the original data, it is resampled.
    "frame_length_ms": 50,  // stft window length in ms.
    "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
    "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "min_level_db": -100,   // normalization range
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
    "power": 1.5,           // value to sharpen wav signals after GL algorithm.
    "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
    // Normalization parameters
    "signal_norm": true,    // normalize the spec values in range [0, 1]
    "symmetric_norm": false, // move normalization to range [-1, 1]
    "max_norm": 1,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "mel_fmin": 95.0,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,        // maximum freq level for mel-spec. Tune for dataset!!
    "do_trim_silence": false  // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},

"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"model": "Tacotron2",   // one of the model in models/    
"grad_clip": 0.02,      // upper limit for gradients for clipping.
"epochs": 1000,         // total number of epochs to train.
"lr": 0.0001,            // Initial learning rate. If Noam decay is active, maximum learning rate.
"lr_decay": false,      // if true, Noam learning rate decaying is applied through training.
"warmup_steps": 4000,   // Noam decay steps to increase the learning rate from 0 to "lr"
"windowing": false,      // Enables attention windowing. Used only in eval mode.
"memory_size": 5,       // TO BE IMPLEMENTED -- memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5. 
"batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention.
"eval_batch_size":16,
"r": 1,                 // Number of frames to predict for step.
"wd": 0.000002,         // Weight decay weight.
"checkpoint": true,     // If true, it saves checkpoints per "save_step"
"save_step": 5000,      // Number of training steps expected to save traning stats and checkpoints.
"print_step": 10,       // Number of steps to log traning on console.
"tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. 
"batch_group_size": 8,  //Number of batches to shuffle after bucketing.

"run_eval": false,
"test_delay_epochs": 100,  //Until attention is aligned, testing only wastes computation time.
"data_path": "/media/erogol/data_ssd/Data/LJSpeech-1.1",  // DATASET-RELATED: can overwritten from command argument
"meta_file_train": "meta/metadata_train.csv",      // DATASET-RELATED: metafile for training dataloader.
"meta_file_val": "meta/metadata_val.csv",    // DATASET-RELATED: metafile for evaluation dataloader.
"dataset": "ljspeech",      // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use "tts_cache" for pre-computed dataset by extract_features.py
"min_seq_len": 0,       // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 1000,     // DATASET-RELATED: maximum text length
"output_path": "/media/erogol/data_ssd/Data/models/ljspeech_models/",      // DATASET-RELATED: output path for all training outputs.
"num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4,    // number of evaluation data loader processes.
"phoneme_cache_path": "ljspeech_us_phonemes",  // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": false,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "en-us",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages
"text_cleaner": "phoneme_cleaners"

`

Or do I still have to start with r=5 and then later proceed with r=1?

erogol commented 5 years ago

@m-toman for T2 with LJSpeech, I trained first with normal prenet then finetuned it with BN prenet since training takes too much time, if you start from zero.

I am also training with BN on Nancy dataset from scratch and attention has not aligned so far. Then, I put dropout to the prenet output, reinit attention layers and now after a while, it started to align. You might also like to try.

And I've not tried char level training for a long time since phonemes give an easy improvement.

m-toman commented 5 years ago

Thanks. Interesting. But you're training with r=1 from scratch now I assume?

Then I'll try as you said.

erogol commented 5 years ago

r=1 yes.

There also, if the network is able to generate good outputs before it aligns the attention, after that point, it is harder to learn the attention since the loss is saturated and there is no enough gradient reaching to attention module. Therefore, it is sometimes useful to reinit attention.

Being said that, my guess is that adding dropout is useful since it degrades the model training performance and conduces more gradient.

you can also check your model's under hood behavior by setting tb_model_param_stats=True

m-toman commented 5 years ago

OK thanks, I'll close this for now and if I find something that might be useful for others I'll post it here.

m-toman commented 5 years ago

Hi,

currently having some free GPU time and tried training LJ from scratch. Started out with the config in the commit from last week (https://github.com/m-toman/TTS/blob/dev-tacotron2/config.json)

So: r=1, phoneme based training, forward attention on, dropout in prenet. But I never got successful alignment, at the moment it is at: image

Previously I also tried the same with forward attention off, looks like this: image

Graphs for both (orange is with forward attention, blue without) image

I'm now also running one with forward attention on and r=5 image Those could potentially get better.

Any ideas why it's struggling so much? I use prenet dropout everywhere (prenet_type "original" and prenet_dropout true)

erogol commented 5 years ago

I'd suggest you to use the config file with the latest released LJSpeech model. I've not tried forward attn with LJSpeech yet. However, for our internal dataset, the one works best is the config_cluster.json. I'll post here, if I have a chance to go on LJspeech.

m-toman commented 5 years ago

Thanks

m-toman commented 5 years ago

Posting another experiment here, once again retraining LJ from scratch, config posted below. Training from the current master branch (fe38c26b86efaf1b3a1e0e85afc27344993b436e)

At > 30k steps still no alignment: image

image

No idea why this is happening, I don't see any reasonable settings in my config. Perhaps the forward attention? Or because I'm training on 2 GPUs with 32 batch size each?

Here the config I'm using:

{
    "model_name": "LJ",
    "model_description": "",

    "audio":{
        // Audio processing parameters
        "num_mels": 80,         // size of the mel spec frame. 
        "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
        "sample_rate": 22050,   // wav sample-rate. If different than the original data, it is resampled.
        "frame_length_ms": 50,  // stft window length in ms.
        "frame_shift_ms": 12.5, // stft window hop-lengh in ms.
        "preemphasis": 0.98,    // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "min_level_db": -100,   // normalization range
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.
        "power": 1.5,           // value to sharpen wav signals after GL algorithm.
        "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
        // Normalization parameters
        "signal_norm": true,    // normalize the spec values in range [0, 1]
        "symmetric_norm": false, // move normalization to range [-1, 1]
        "max_norm": 1,          // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "mel_fmin": null,         // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": null,        // maximum freq level for mel-spec. Tune for dataset!!
        "do_trim_silence": true  // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    },

    "distributed":{
        "backend": "nccl",
        "url": "tcp:\/\/localhost:54321"
    },

    "embedding_size": 256,  // Character embedding vector length. You don't need to change it in general.
    "text_cleaner": "phoneme_cleaners",
    "epochs": 1000,         // total number of epochs to train.
    "lr": 0.0001,            // Initial learning rate. If Noam decay is active, maximum learning rate.
    "lr_decay": false,      // if true, Noam learning rate decaying is applied through training.
    "loss_weight": 0.0,     // loss weight to emphasize lower frequencies. Lower frequencies are in general more important for speech signals.
    "warmup_steps": 4000,   // Noam decay steps to increase the learning rate from 0 to "lr"
    "windowing": false,      // Enables attention windowing. Used only in eval mode.
    "memory_size": 5,       // memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5. 
   "batch_size": 64,       // Batch size for training. Lower values than 32 might cause hard to learn attention.
    "eval_batch_size":32,   
    "r": 1,                 // Number of frames to predict for step.
    "wd": 0.00001,          // Weight decay weight.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "save_step": 1000,      // Number of training steps expected to save traning stats and checkpoints.
    "print_step": 50,       // Number of steps to log traning on console.
    "tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging. 
    "batch_group_size": 8,  //Number of batches to shuffle after bucketing.

    "run_eval": true,
    "test_delay_epochs": 100,  //Until attention is aligned, testing only wastes computation time.
        "data_path": "/home/markus/LJ-hvb/",  // DATASET-RELATED: can overwritten from command argument
        "meta_file_train": "/home/markus/LJ-hvb/etc/metadata_train.csv",      // DATASET-RELATED: metafile for training dataloader.
        "meta_file_val": "/home/markus/LJ-hvb/etc/metadata_val.csv",    // DATASET-RELATED: metafile for evaluation dataloader.
    "dataset": "ljspeech",      // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use "tts_cache" for pre-computed dataset by
 extract_features.py
    "min_seq_len": 0,       // DATASET-RELATED: minimum text length to use in training
    "max_seq_len": 300,     // DATASET-RELATED: maximum text length
        "output_path": "/home/markus/mozillalj/",      // DATASET-RELATED: output path for all training outputs.
    "num_loader_workers": 8,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "phoneme_cache_path": "ljspeech_us_phonemes",  // phoneme computation is slow, therefore, it caches results in the given folder.
    "use_phonemes": true,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
    "phoneme_language": "en-us"     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages
}

BTW, I saw you added RAdam. I recently retrained fatchords WaveRNN (MoL) with RAdam + LookAHead (https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer) and while I did not end up with a better model the path was much smoother.

erogol commented 5 years ago

@m-toman thx for the link I'll check that. RADAM converged smoother and slightly better in my case.

Forward attention might be the culprit. I did not run Tacotron2 for a long time. I am running around Tacotron in general. But let me know if it does not work after disabling forward_attn.

Do you use distribute.py? I see you set batch_size 64. If you use it, then it means you have 64 in each GPU.

m-toman commented 5 years ago

Yes I do. Hmm wonder that it even works with 64 then. I'm running on two P100s with 16GB. But I'll retry without forward and with bs32. Does it even make sense to use 2 GPUs then?

erogol commented 5 years ago

If 32 fits in a GPU, don't need a second GPU I guess

kan-bayashi commented 5 years ago

Hi all. Is it still difficult to train from scratch with reduction factor = 1 and character on LJSpeech? And if we use phoneme, is it easy? In our implementation, there is no problem in training with reduction factor = 1. I want to figure out the reason.

erogol commented 5 years ago

It works in general but different runs might give different results. Is your implementation open source?

kan-bayashi commented 5 years ago

Is it caused by the difference of random seed? We are developing ESPnet, which is end-to-end speech processing toolkit. We want to compare our samples with this implementation.

erogol commented 5 years ago

@kan-bayashi ohh yeah I know ESPnet. I watched you at Interspeech. It is a great collection of models. Good Job!

Not really. The random seed is constant but if the dataset is noisy then sometimes attention behaves in weird ways. But I'd say for a decent dataset, TTS should work without pain.

Regarding r=1, things just work fine so far (at least for me) but take a long time. And sometimes for LJSpeech attention is more fragile.

kan-bayashi commented 5 years ago

Thanks :) Recently we are working hard not only on ASR but also TTS.

I see. I also met the fragile attention learning when using speech which has low SNR (e.g. some speaker in m-ailabs). At first I will compare with your released model!

erogol commented 4 years ago

With the current setup of TTS, I am able to train good models with r=1,