neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
13.18k stars 1.82k forks source link

VQ-VAE training details #92

Closed ioneuk closed 2 years ago

ioneuk commented 2 years ago

Hello. Your work is absolutely great. One of the best TTS that I've ever seen. I am trying to understand your work conceptually. My questions are related to VQ-VAE pretraining on the speech data:

  1. I am struggling to understand whether the Diffusion model was trained as a VQ-VAE decoder or VQ-VAE decoder and the diffusion model are different models trained separately.
  2. Was VQ-VAE conditioned on text and conditioning latent?
  3. It would be great to know the architecture/losses and other training details of the VQ-VAE model you used. Are they the same as described here ?
neonbjb commented 2 years ago

Hey there.

  1. A VQVAE was trained separately using a conventional MSE objective. I then froze the encoder and quantizer and trained the diffusion model to decode the quantized outputs. Of note is that I have been experimenting with training the encoder+quantizer alongside the diffusion model on a recent project. This actually can be made to work pretty well - I'd recommend trying it if you are looking to reproduce this work.
  2. VQVAE was not conditioned on either. The quantizer I used (which is exactly the one described in DeepMind's paper) was not tolerant of the complex encoder that would have been required if I went this path. If I deviated beyond a simple encoder architecture, the codebook would collapse during training. This is something I am actively looking into (as stated above).
  3. Architecture is here: https://github.com/neonbjb/DL-Art-School/blob/master/codes/models/audio/tts/lucidrains_dvae.py; Specific hyperparameters: {channels:80,codebook_dim:256,hidden_dim:512,kernel_size:3,num_layers:2,num_resnet_blocks:3,num_tokens:8192,positional_dims:1,use_transposed_convs:false}. Losses are MSE reconstruction loss and a commitment loss as documented in the paper you linked (and returned by the specified model). The commitment loss gets a weight of .25. The MSE gets a weight of 1. I found large batch sized significantly improves the reconstruction loss of this model, so a batch size of 8192 was used. Adamw is used, LR=3e-4, WD=.01, everything else is defaults. I did not do much tuning of these hyperparameters.
QajikHakobyan commented 2 weeks ago

Hi @neonbjb,

First off, thank you for the incredible work on this project! I have a quick question regarding the Mel Spectrograms used in the discrete autoencoder and diffusion training. Could you please clarify what type of normalization is applied to the Mel Spectrograms during these stages?

Thanks in advance!