Multi-speaker Tacotron model training from scratch

oytunturk commented 3 years ago

Hi,

I'm trying to train a multi-speaker Tacotron model from scratch using VCTK + LibriTTS databases. The model trains fine until about 50K global steps but after that I start running into "CUDA out of memory", "NaN loss with key=decoder_coarse_loss", or "NaN loss with key=decoder_loss" errors. I tried reducing batch sizes, limiting input sequence lengths, and/or reducing learning rate but those didn't seem to help. I also tried training from scratch using VCTK only and ended up with similar errors. I'm training on a single Titan X GPU with 12GB memory. I didn't want to try multi-gpu training yet so I wonder if I should be setting some parameters differently in the config file. Any suggestions? Also, can someone explain the following parameters and how they should be set for single GPU training? Or, should I simply avoid single GPU training?

"num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.                                                                                 
"num_val_loader_workers": 4,    // number of evaluation data loader processes.                                                                                                                          
"batch_group_size": 4,  //Number of batches to shuffle after bucketing.

Thanks!

Additional info:

My branch is based on commit ea976b0543c7fa97628c41c4a936e3113896d18a
Config file attached
Tensorboard loss plots, attention alignments, output spectrograms, Griffin-Lim synthesized audio look/sound as expected before running into these errors
As far as I can tell, the errors occur pretty randomly. It could continue training a couple of thousands steps after 50K steps or fail after 500 steps. I don't also see any specific input files triggering these errors in a consistent manner.

albluc24 commented 3 years ago

it happened to me too at times. That's not an official solution nor a definitive one, but if you want a quick and dirty way, I just took the train command and put it in a bash while loop, so whenever it stopped training it would resume and eventually reach the point I wanted it to.

erogol commented 3 years ago

What is the problem exactly? NaN loss or OOM ?

oytunturk commented 3 years ago

I see both and it's pretty random. In the meanwhile, I tried the following but I'm still having similar issues:

1) Reducing all batch sizes to 32 and when that failed to 16.

2) Reducing max_seq_len to 120 and to even 80. Note that I'm using VCTK and LibriTTS databases with silence trimming on. I also checked the outputs of silence trimming to make sure I'm not getting any audio that's too short etc.

3) Tried adding forced memory clean-up in the training loop using torch.cuda.empty_cache(). This helped me get through more iterations without OOM errors but they eventually started showing up again.

4) I noticed that when I decrease save_step from 10K to a lower value in the config file, I see these errors more often. However, I can't say that I observe a consistent pattern. For instance, save_step=25000 trains fine until about 54K steps and then runs out of memory again.

5) I'm now trying fixed reduction rate. Should I try turning off double_decoder_consistency, too?

6) The OOM error can be handled as @albluc24 suggested but it eventually results in NaN loss.

7) I'll check the mel spectrogram values to see if there are all-zero or inf values. If that's the case and if there's any zero padding happening in the code, that might be causing some kind of underflow/overflow in mel spectrogram calculations.

8) I also tried using VCTK only assuming it's caused by adding LibriTTS but observed similar problems there, too.

I'll keep looking into it and updating you here but I wonder if you have other suggestions. Thanks!

erogol commented 3 years ago

Ok too much to cover here but,

OOM is likely to happen because some of the text is longer when converted to phonemes (e.g. sentences with numbers). So these samples are not filtered at the beginning but they are longer than the max seq len once they are converted. Dev has an option to compute phonemes at the beginning to solve this problem. Or remove such samples manually.

DDC also increases the memory usage you can try to disable it but the quality might get lower especially for multi-speaker models with harder datasets.

oytunturk commented 3 years ago

Thanks @erogol ! I'm using the phoneme cache so I assume that takes care of it but I'll check again. I'm training a non-DDC model to see if the problem persists there, too.

Can you please describe the steps needed to use multi-GPU training? What tools/libraries should be installed? After setting the distributed training options in the config file, setting the CUDA_VISIBLE_DEVICES environment variable to multiple GPUs, and running train_tacotron.py, it just seems to hang forever and do nothing. There are no errors but no progress or GPU activity as well.

erogol commented 3 years ago

https://github.com/mozilla/TTS#training-and-fine-tuning-lj-speech

oytunturk commented 3 years ago

Thanks! I'll give that a try.

oytunturk commented 3 years ago

Here's a brief update in case it helps others who might be experimenting with multi-speaker training:

Setting gradual training parameters as follows seems to resolve the OOM issue I've been seeing after 50K steps. "gradual_training": [[0, 7, 32], [1, 5, 32], [100000, 3, 16], [250000, 2, 16], [500000, 1, 16]] However, I'm not sure if it will happen again at 100K steps or whenever reduction factor and/or batch size gradually change.
On a separate training session, turning off DDC didn't cause any OOM/NaN loss issues so far. It's still struggling with attention alignment though.
Multi-gpu training seems to start fine after using the correct script. I didn't have spare GPUs to continue that training session but I will. Does anyone know if the training speed up is close to linear by adding more GPUs? Or, does loading the data, preprocessing etc become bottlenecks?

erogol commented 3 years ago

it is not linear due to GPU communication overhead.

oytunturk commented 3 years ago

Here's how attention alignments look like after 150K steps of Tacotron training using VCTK+LibriTTS databases:

Tensorboard outputs

Audio started to become somewhat intelligible but training alignments look concerning to me. Do you think I should continue training or do you have other suggestions to change hyper-parameters etc?

Thanks in advance!

erogol commented 3 years ago

your input mel specs look weird. Are you sure everything is alright. They are too bright but the mode outputs look alright.

oytunturk commented 3 years ago

At some point, I was experimenting with preemphasis=0.97. I think it's set to 0.0 for this one but I'll check again. I also noticed that I'm setting the sampling rate to 22KHz in the config file but max mel frequency is set as 7600Hz. I'll try setting it closer to the full bandwidth. BTW, I haven't resampled the original databases to 22KHz. I believe original VCTK is at 48KHz and LibriTTS is 24KHz. I made some changes in audio utils to handle resampling but maybe I did something wrong. Will be good to check again.

lexkoro commented 3 years ago

https://github.com/mozilla/TTS/issues/607#issuecomment-752310974 seems to have the same 'issue' with the mel specs. Seeing as you also use spec_gain: 1 but don't provide a stats_path I think it could be the problem. If I remember correctly I've encountered the same issue before. Try training with spec_gain: 20 for some steps and check tensorboard for the spectrograms if they seem 'normal'.

oytunturk commented 3 years ago

Thanks @SanjaESC ! I was actually using the stats_path initially but I decided to follow the comment in the config file to skip it. Anyway, I was running into other issues then. So, I'll check how spec_gain affects the output spectrograms, too.

oytunturk commented 3 years ago

Following changes and fixing an issue in how resampling is handled for my database results in better looking ground truth mel-spectrograms:

"sample_rate": 24000
"ref_level_db": 0
"spec_gain": 20

Too early to tell how training will progress. TensorBoard2

oytunturk commented 3 years ago

Multi-speaker training using VCTK+LibriTTS is going well. However, I'm noticing that multi-gpu training is much slower than training using just one GPU. I'm training very similar models using 10 GPUs and in a separate server using one GPU. Data sits at the same mounted location for both servers.

10 GPU server: a mix of TitanVs, RTX2080s, and GTX1080s / 56 CPU cores / 512GB RAM / training speed: ~3 seconds per step
Single GPU server: TitanX / 40 CPU cores / 256GB RAM / training speed: ~1 second/step

Has anyone experienced similar behavior? Any suggestions to speed up multi-gpu training?

Thanks!

oytunturk commented 3 years ago

After 150K steps on a single GPU, I ran into the 'NaN loss' issue again. It seems that the backward decoder alignments are failing for some reason. Do you have any insight on this? Thanks!

The error is:

Traceback (most recent call last): File "/dat/training/oturk/task_tts/mozilla_multispeaker_wavegrad_libritts/TTS_repo/TTS/bin/train_tacotron.py", line 698, in main(args) File "/dat/training/oturk/task_tts/mozilla_multispeaker_wavegrad_libritts/TTS_repo/TTS/bin/train_tacotron.py", line 609, in main global_step, epoch, scaler, scaler_st, speaker_mapping) File "/dat/training/oturk/task_tts/mozilla_multispeaker_wavegrad_libritts/TTS_repo/TTS/bin/train_tacotron.py", line 185, in train text_lengths) File "/home-shared/oytun/miniconda3/envs/mozilla-tts-multispeaker-wavegrad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/dat/training/oturk/task_tts/mozilla_multispeaker_wavegrad_libritts/TTS_repo/TTS/tts/layers/losses.py", line 402, in forward raise RuntimeError(" [!] NaN loss with key=" + str(key)) RuntimeError: [!] NaN loss with key=decoder_coarse_loss

Here's what my terminal looks like after introducing some debugging code to print out mel inputs, alignments etc. And, here's my tacotron config file. Tensorboard outputs: scalars , images

If I start it again from the 150K checkpoint, it goes fine for a while but it eventually ends up with the same error after a couple of thousand steps.

lexkoro commented 3 years ago

@oytunturk I am also running into the same issue as you are (using multi-speaker). I haven't looked into what could be the cause so far, so thanks for the logs and updates.

erogol commented 3 years ago

if r is not a multiple of ddc_r, it might happen but other than this I don't see any other reason. It'd be nice to debug.

oytunturk commented 3 years ago

Hi @erogol,

Do you mean ddc_r should be a multiple of r as we keep reducing r during training? I'll check if that helps.

I was also suspicious if silence trimming is not working well causing long silence segments in some files. If some of those segments are all zeros, that might be causing issues. Are there any safeguards against that in the code? It's usually common practice to add very low amplitude noise to each speech frame to guard against log zeros, divide by zeros, etc.

oytunturk commented 3 years ago

I have two training sessions going on. With ddc_r=8 and r progressively set to 8, 4, or 2 as training proceeds, there seems to be no issues. With ddc_r=7, I hit the OOM and NaN loss errors a couple of times when r was odd (7, 5, 3, 1). It didn't happen when r was even. Final stage of ddc_r=8 session will be using r=1. I wonder if I'll see similar errors there.

When I run into OOM or NaN loss issues when r is odd, I simply go back to a healthy checkpoint with even r, readjust r to be the closest even value and keep training. The models seem to work fine capturing different speakers and resulting in quite intelligible Griffin-Lim synthesis outputs so far but I'll try to train these models until convergence.

erogol commented 3 years ago

any update ?

oytunturk commented 3 years ago

Yes, indeed. It seems odd values of ddc_r are causing random OOM errors at least with the databases I'm using (VCTK+LibriTTS entire train set). I trained the model up to 1 million steps while reducing ddc_r progressively. The model trains fine when ddc_r=8, 4, 2 but when ddc_r=7, 5, 3, or 1, OOM keeps creeping up every couple of thousand steps. I was able to proceed by running the training script in a loop so that it continues training as soon as it fails. This seems to work fine, all losses go down nicely except the alignment loss which is kind of cyclical. It goes down for a while and then back up, then down again. The checkpoints with lower alignment loss seem to work a bit better but overall I'm still not satisfied with the model's robustness and quality.

It's probably obvious but LibriTTS is not an easy database for Tacotron training. My next experiments will focus on limiting the LibriTTS set to train-clean-100 portion and using ASR based alignments to detect/trim start/end silences or maybe eliminating some of the recordings that have long pauses in between words. Still, it might be necessary to use a more balanced, higher quality database for training a robust Tacotron model from scratch that covers more speakers. Or, maybe starting with VCTK and fine-tuning with more speakers after the model is stable.

Unless there are any ideas, I think we can close this issue. I'll re-open if I have interesting observations to share. Thanks for all the feedback!

rohin-dasari commented 3 years ago

I've been training Tacotron2 with dynamic convolution attention on the LibreTTS train-clean-100, and I've been running into the NaN issue with the decoder loss. I've been using a "r" value of 2. I haven't been using double decoder consistency, so I don't think I have to mess with the "ddc_r" parameter.

I've tried just allowing the error to pass silently, and trying to continue training, but the error seems to persist for the time being.

oytunturk commented 3 years ago

Looks like the following audio settings worked fine for me when training a LibriTTS model with r=2. ddc_r should not change anything if you are progressively reducing r. As far as I can remember, reducing trim_db and adjusting normalization parameters got rid of NaN issues. Maybe, adjusting min-max input sequence lengths, too.

"audio":{
    // stft parameters                                                                                                                                                                                  
    "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.                                                                                                  
    "win_length": 1024,      // stft window length in ms.                                                                                                                                               
    "hop_length": 256,       // stft window hop-lengh in ms.                                                                                                                                            
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.                                                                                                                 
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.                                                                                                             

    // Audio processing parameters                                                                                                                                                                      
    "sample_rate": 24000,   // DATASET-RELATED: wav sample-rate.                                                                                                                                        
    "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.                                                                                 
    "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.                                                                                                               

    // Silence trimming                                                                                                                                                                                 
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (true), TWEB (false), Nancy (true)                                                                          
    //"trim_db": 60,          // threshold for timming silence. Set this according to your dataset.                                                                                                     
    "trim_db": 50, //reduce to trim more                                                                                                                                                                

    // Griffin-Lim                                                                                                                                                                                      
    "power": 1.5,           // value to sharpen wav signals after GL algorithm.                                                                                                                         
    "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.                                                                                 

    // MelSpectrogram parameters                                                                                                                                                                        
    "num_mels": 80,         // size of the mel spec frame.                                                                                                                                              
    "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!                                                                             
    "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!                                                                                                                      
    "spec_gain": 20,

    // Normalization parameters                                                                                                                                                                         
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.                                      
    "min_level_db": -100,   // lower bound for normalization                                                                                                                                            
    "symmetric_norm": true, // move normalization to range [-1, 1]                                                                                                                                      
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]                                                                                                      
    "clip_norm": true,      // clip normalized values into the range.                                                                                                                                                                                      
    "stats_path": null
},

mozilla / TTS

Multi-speaker Tacotron model training from scratch #612