lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
767 stars 112 forks source link

help request. trying to figure out how to match up params for TTS to Vocoder. #23

Closed michael-conrad closed 2 years ago

michael-conrad commented 2 years ago

I'm using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.

The TTS project I'm using shows the below for audio params, but I don't know what to change in either the TTS params or the vocoder params to have them match up. I'm guessing the hopsamples somehow matches up with the sftp* settings, but, am a bit clueless as to what I'm looking at. I'm thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.

TTS Tacotron Settings

    sample_rate = 22050                  # sample rate of source .wavs, used while computing spectrograms, MFCCs, etc.
    num_fft = 1102                       # number of frequency bins used during computation of spectrograms
    num_mels = 80                        # number of mel bins used during computation of mel spectrograms
    num_mfcc = 13                        # number of MFCCs, used just for MCD computation (during training)
    stft_window_ms = 50                  # size in ms of the Hann window of short-time Fourier transform, used during spectrogram computation
    stft_shift_ms = 12.5                 # shift of the window (or better said gap between windows) in ms   

diffwave Vocoder Settings

# Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1024,
    hop_samples=256,
sharvil commented 2 years ago

I've listed the mapping between the supplied Tacotron parameters and Diffwave parameters.

Tacotron Diffwave
sample_rate sample_rate
num_fft n_fft
num_mels n_mels
num_mfcc N/A
stft_window_ms hop_samples/sample_rate*1000*4
stft_shift_ms hop_samples/sample_rate*1000

If you want to change the Diffwave parameters to match Tacotron, you'd let stft_shift_ms=12.5 which means hop_samples/sample_rate*1000=12.5. Solving for hop_samples, you get hop_samples=275.625. Of course, a fractional hop size isn't allowed so you'd have to do whatever the Tacotron code is doing to either round up or down to match.

If you want to change the Tacotron parameters to match Diffwave, you'd let hop_samples=256 which means stft_shift_ms=256/sample_rate*1000 and so stft_shift_ms=11.609977324263038. I'm not sure if the Tacotron code will make rounding errors given this frame shift, but you could try it out and see for yourself.

michael-conrad commented 2 years ago

I was thinking of trying the following for the vocoder. How do the numbers look?

# Data params
    sample_rate=22050,
    n_mels=80,
    n_fft=1102,  # 1024,
    hop_samples=275,
    crop_mel_frames=50,  # 62,  # Probably an error in paper.
sharvil commented 2 years ago

That looks reasonable and is worth a shot. I'd do a short training run on a small dataset to make sure the parameters are commensurate with each other.

michael-conrad commented 2 years ago

ok, switched to using the repo as a submodule in project https://github.com/CherokeeLanguage/cherokee-diffwave adjusted params, but now I'm getting an error:

RuntimeError: The size of tensor a (13750) must match the size of tensor b (12800) at non-singleton dimension 2
python train.py --fp16 --max_steps 5000000 --batch_size 32 models/ wavs/
Epoch 0:   0%|                                                                                                                                                                            | 0/1009 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 53, in <module>
    main(parser.parse_args())
  File "train.py", line 40, in main
    train(args, params)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 173, in train
    _train_impl(0, model, dataset, args, params)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 164, in _train_impl
    learner.train(max_steps=args.max_steps)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 108, in train
    loss = self.train_step(features)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 136, in train_step
    predicted = self.model(noisy_audio, t, spectrogram)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/model.py", line 158, in forward
    x, skip_connection = layer(x, diffusion_step, spectrogram)
  File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/model.py", line 116, in forward
    y = self.dilated_conv(y) + conditioner
RuntimeError: The size of tensor a (13750) must match the size of tensor b (12800) at non-singleton dimension 2
sharvil commented 2 years ago

This happens because the spectrogram frames are upsampled by a factor of 256, and not 275 (your new hop size). Here's how you can change the upsampling module to go up to a factor of 275:

class SpectrogramUpsampler(nn.Module):
  def __init__(self, n_mels):
    super().__init__()
    self.conv1 = ConvTranspose2d(1, 1, [3, 22], stride=[1, 11], padding=[1, 6], output_padding=[0, 1])
    self.conv2 = ConvTranspose2d(1, 1,  [3, 50], stride=[1, 25], padding=[1, 13], output_padding=[0, 1])
michael-conrad commented 2 years ago

Ok,

I've changed it to the following (using tuples and not lists to fix IDE complaint):

diff --git a/src/diffwave/model.py b/src/diffwave/model.py
index 58485e4..4ad5430 100644
--- a/src/diffwave/model.py
+++ b/src/diffwave/model.py
@@ -72,8 +72,8 @@ class DiffusionEmbedding(nn.Module):
 class SpectrogramUpsampler(nn.Module):
   def __init__(self, n_mels):
     super().__init__()
-    self.conv1 = ConvTranspose2d(1, 1, [3, 32], stride=[1, 16], padding=[1, 8])
-    self.conv2 = ConvTranspose2d(1, 1,  [3, 32], stride=[1, 16], padding=[1, 8])
+    self.conv1 = ConvTranspose2d(1, 1, (3, 22), stride=(1, 11), padding=(1, 6), output_padding=(0, 1))
+    self.conv2 = ConvTranspose2d(1, 1, (3, 50), stride=(1, 25), padding=(1, 13), output_padding=(0, 1))

   def forward(self, x):
     x = torch.unsqueeze(x, 1)

So far so good:

batch size = 64 fp16 = True

sample count = 32298 504 iterations per epoch 1.26s/it GPU mem: 16517MiB

still on first epoch at the moment.

michael-conrad commented 2 years ago

Is there a particular loss value I should trying to reach? I'm currently on epoch 30, weights last saved at step 15,120, loss 0.1054

image

sharvil commented 2 years ago

The loss value is data-dependent so I can't really say what to expect on your dataset. On the LJSpeech dataset with slightly different parameters than what's in this repo, the loss goes to about 0.015 after ~1M steps.

I recommend running inference every 25 epochs or so, especially early on, to make sure everything is ok.

michael-conrad commented 2 years ago

Attached are my results using a random sampling of the training npy files.

This is at iteration 233,856. [epoch 463]. batch size 64. 504 iters per epoch. 3 days 12 hours.

Train/loss is currently fluctuating between 0.10 to 0.06

diffwave-hp275-samples.tar.gz

sharvil commented 2 years ago

Sounds like training is progressing as expected. The training loss for this generation of diffusion models has pretty high variance because of the noise schedule sampling procedure so don't let the fluctuation deter you. The model typically improves even when it looks like the loss has flattened out.

Given that you're training a multi-speaker model, I recommend training on all speakers for a large number of iterations/epochs, and then fine-tuning on individual speakers if the multi-speaker model isn't good enough.