Closed michael-conrad closed 2 years ago
I've listed the mapping between the supplied Tacotron parameters and Diffwave parameters.
Tacotron | Diffwave |
---|---|
sample_rate | sample_rate |
num_fft | n_fft |
num_mels | n_mels |
num_mfcc | N/A |
stft_window_ms | hop_samples/sample_rate*1000*4 |
stft_shift_ms | hop_samples/sample_rate*1000 |
If you want to change the Diffwave parameters to match Tacotron, you'd let stft_shift_ms=12.5
which means hop_samples/sample_rate*1000=12.5
. Solving for hop_samples
, you get hop_samples=275.625
. Of course, a fractional hop size isn't allowed so you'd have to do whatever the Tacotron code is doing to either round up or down to match.
If you want to change the Tacotron parameters to match Diffwave, you'd let hop_samples=256
which means stft_shift_ms=256/sample_rate*1000
and so stft_shift_ms=11.609977324263038
. I'm not sure if the Tacotron code will make rounding errors given this frame shift, but you could try it out and see for yourself.
I was thinking of trying the following for the vocoder. How do the numbers look?
# Data params
sample_rate=22050,
n_mels=80,
n_fft=1102, # 1024,
hop_samples=275,
crop_mel_frames=50, # 62, # Probably an error in paper.
That looks reasonable and is worth a shot. I'd do a short training run on a small dataset to make sure the parameters are commensurate with each other.
ok, switched to using the repo as a submodule in project https://github.com/CherokeeLanguage/cherokee-diffwave adjusted params, but now I'm getting an error:
RuntimeError: The size of tensor a (13750) must match the size of tensor b (12800) at non-singleton dimension 2
python train.py --fp16 --max_steps 5000000 --batch_size 32 models/ wavs/
Epoch 0: 0%| | 0/1009 [00:01<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 53, in <module>
main(parser.parse_args())
File "train.py", line 40, in main
train(args, params)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 173, in train
_train_impl(0, model, dataset, args, params)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 164, in _train_impl
learner.train(max_steps=args.max_steps)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 108, in train
loss = self.train_step(features)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/learner.py", line 136, in train_step
predicted = self.model(noisy_audio, t, spectrogram)
File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/model.py", line 158, in forward
x, skip_connection = layer(x, diffusion_step, spectrogram)
File "/home/muksihs/.conda/envs/cherokee-diffwave/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/muksihs/git/cherokee-diffwave/diffwave/src/diffwave/model.py", line 116, in forward
y = self.dilated_conv(y) + conditioner
RuntimeError: The size of tensor a (13750) must match the size of tensor b (12800) at non-singleton dimension 2
This happens because the spectrogram frames are upsampled by a factor of 256, and not 275 (your new hop size). Here's how you can change the upsampling module to go up to a factor of 275:
class SpectrogramUpsampler(nn.Module):
def __init__(self, n_mels):
super().__init__()
self.conv1 = ConvTranspose2d(1, 1, [3, 22], stride=[1, 11], padding=[1, 6], output_padding=[0, 1])
self.conv2 = ConvTranspose2d(1, 1, [3, 50], stride=[1, 25], padding=[1, 13], output_padding=[0, 1])
Ok,
I've changed it to the following (using tuples and not lists to fix IDE complaint):
diff --git a/src/diffwave/model.py b/src/diffwave/model.py
index 58485e4..4ad5430 100644
--- a/src/diffwave/model.py
+++ b/src/diffwave/model.py
@@ -72,8 +72,8 @@ class DiffusionEmbedding(nn.Module):
class SpectrogramUpsampler(nn.Module):
def __init__(self, n_mels):
super().__init__()
- self.conv1 = ConvTranspose2d(1, 1, [3, 32], stride=[1, 16], padding=[1, 8])
- self.conv2 = ConvTranspose2d(1, 1, [3, 32], stride=[1, 16], padding=[1, 8])
+ self.conv1 = ConvTranspose2d(1, 1, (3, 22), stride=(1, 11), padding=(1, 6), output_padding=(0, 1))
+ self.conv2 = ConvTranspose2d(1, 1, (3, 50), stride=(1, 25), padding=(1, 13), output_padding=(0, 1))
def forward(self, x):
x = torch.unsqueeze(x, 1)
So far so good:
batch size = 64 fp16 = True
sample count = 32298 504 iterations per epoch 1.26s/it GPU mem: 16517MiB
still on first epoch at the moment.
Is there a particular loss value I should trying to reach? I'm currently on epoch 30, weights last saved at step 15,120, loss 0.1054
The loss value is data-dependent so I can't really say what to expect on your dataset. On the LJSpeech dataset with slightly different parameters than what's in this repo, the loss goes to about 0.015
after ~1M steps.
I recommend running inference every 25 epochs or so, especially early on, to make sure everything is ok.
Attached are my results using a random sampling of the training npy files.
This is at iteration 233,856. [epoch 463]. batch size 64. 504 iters per epoch. 3 days 12 hours.
Train/loss is currently fluctuating between 0.10 to 0.06
Sounds like training is progressing as expected. The training loss for this generation of diffusion models has pretty high variance because of the noise schedule sampling procedure so don't let the fluctuation deter you. The model typically improves even when it looks like the loss has flattened out.
Given that you're training a multi-speaker model, I recommend training on all speakers for a large number of iterations/epochs, and then fine-tuning on individual speakers if the multi-speaker model isn't good enough.
I'm using a fork of https://github.com/Tomiinek/Multilingual_Text_to_Speech as the project https://github.com/CherokeeLanguage/Cherokee-TTS.
The TTS project I'm using shows the below for audio params, but I don't know what to change in either the TTS params or the vocoder params to have them match up. I'm guessing the hopsamples somehow matches up with the sftp* settings, but, am a bit clueless as to what I'm looking at. I'm thinking it would be good start to adjust the vocoder settings and train on the domain specific voices being used in the Tacotron training.
TTS Tacotron Settings
diffwave Vocoder Settings