r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.32k stars 500 forks source link

Planned TODOs #1

Closed r9y9 closed 6 years ago

r9y9 commented 6 years ago

This is an umbrella issue to track progress for my planned TODOs. Comments and requests are welcome.

Goal

Model

Training script

Experiments

Misc

Sampling frequency

Advanced (lower priority)

r9y9 commented 6 years ago

At the moment, I think I finished to implement basic features (batch/incremental inference, local/global conditioning) and confirmed that unconditioned WaveNet trained on CMU Arctic (~1200 utterances, 16kHz) can generate sounds like speech. Audio samples are attached.

step80000.zip

top: real speech, bottom: generated speech. The only first one sample of real-speech was fed to the WaveNet decoder as an initial input.

step000080000_waveplots

step90000.zip

step000090000_waveplots

geneing commented 6 years ago

For reference, these are other wavenet projects I know of: https://github.com/ibab/tensorflow-wavenet https://github.com/tomlepaine/fast-wavenet - faster version of the original wavenet paper.

r9y9 commented 6 years ago

Other projects I know of:

r9y9 commented 6 years ago

Still not quite high quality, but vocoder conditioned on mel-spectrogram started to work. Audio samples from a model trained 10 hours are attached.

step90000.zip step000090000_waveplots

step95000.zip step000095000_waveplots

r9y9 commented 6 years ago

Finished transposed convolution support at https://github.com/r9y9/wavenet_vocoder/commit/8c0b5a9b65150227a5989865257eb2651fd5751f. Started training again.

jamestang0219 commented 6 years ago

Hi, I've already tried to use linguistic features as local features, but I found there might be a problem that linguistic features are based on phoneme class, mel-specs are based on frame class, but the local features of wavenet inputs are based on sample point class.

Here is a case, if a phoneme's duration is 0.25s, and its sample rate is 16k, in order to create the wavenet inputs, I have to duplicate the single phoneme's linguistic feature to int(0.25 * 16000) times as their samples' local features. Do you think my practice is right or not? How do you process the mel-spec features while they are frame class?

Thanks for answering me.

jamestang0219 commented 6 years ago

Wavenet can capture the differences even if many samples' local features are same as long as its receptive field is wide?

r9y9 commented 6 years ago

@jamestang0219 I think you are right. In the paper http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0314.PDF, they use log-f0 and mel-cepstrum as conditional features and duplicate them to adjust time resolution. I also tried this idea and got reasonable result.

r9y9 commented 6 years ago

Latest audio sample attached. Mel-spectrogram are repeated to adjust time resolution. See https://github.com/r9y9/wavenet_vocoder/blob/b8ee2ce0ed246344581adf36dd7cb1a9dd1be6a2/audio.py#L39-L40. In this case upsample_factor was always 256.

step70000.zip step000070000_waveplots

jamestang0219 commented 6 years ago

@r9y9 In your source code, you use transposed convolution to implement the upsample process? Have you ever checked which method is better for upsampling?

r9y9 commented 6 years ago

@jamestang0219 I implemented transposed convolution but haven't got success yet. I wonder 256x upsampling is hard to train, especially for small dataset which I'm experimenting with now. WaveNet authors reported transposed convolution is better, though.

r9y9 commented 6 years ago

https://github.com/r9y9/wavenet_vocoder/blob/3c9deb1dfd582b87488359c366731b7bce7120d4/hparams.py#L43-L47

For now I am not using transposed convolution.

jamestang0219 commented 6 years ago

@r9y9 May I know your hyper parameters for extracting mel spectrogram? Frame shift is 0.0125s and frame width is 0.05s? If this is your parameters, but why you use 256 as the upsample factor instead of sr(16000) * frame_shift(0.0125) = 200? Any tricks here? Forgive me for many questions:( because I also wanna reproduce tacotron2 result

r9y9 commented 6 years ago

@jamestang0219 Hyper parameters for audio parameter extraction: https://github.com/r9y9/wavenet_vocoder/blob/3c9deb1dfd582b87488359c366731b7bce7120d4/hparams.py#L19-L28

I use frame shift 256 samples / 16 ms.

jamestang0219 commented 6 years ago

@r9y9 Thanks:)

npuichigo commented 6 years ago

@r9y9 I notice that in Tacotron2, two upsampling layers with transposed convolution are used. But in my WaveNet implementation, it still can't work.

r9y9 commented 6 years ago

@npuichigo Could you share what parameters (padding, kernel_size, etc) you are using? I tried 1d transposed covolution with stride=16, kernel_size=16, padding=0 two times to upsample inputs to 256x.

https://github.com/r9y9/wavenet_vocoder/blob/8c0b5a9b65150227a5989865257eb2651fd5751f/wavenet_vocoder/wavenet.py#L105-L112

npuichigo commented 6 years ago

@r9y9 Parameters of mine are listed below. Because I use frame shift which is 12.5ms, upsampling factor is 200.

# Audio
num_mels=80,
num_freq=1025,
sample_rate=16000,
frame_length_ms=50,
frame_shift_ms=12.5,
min_level_db=-100,
ref_level_db=20

# Tranposed convolution 10*20=200 (tensorflow)
up_lc_batch = tf.expand_dims(lc_batch, 1)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 10),
       strides=(1, 10), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 20),
       strides=(1, 20), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.squeeze(up_lc_batch, 1)
r9y9 commented 6 years ago

https://r9y9.github.io/wavenet_vocoder/

Created a simple project page and uploaded audio samples for speaker-dependent WaveNet vocoder. I'm working on global conditioning (speaker embedding) now.

npuichigo commented 6 years ago

@r9y9 Regarding upsampling network, I found that 2D transposed convolution works well, while 1D version will generate speech with unnatural prosody, maybe because 2D transpose convolution only consider local information in frequency domain.

height_width = 3  # kernel width along frequency axis
up_lc_batch = tf.expand_dims(lc_batch, 3)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (10, height_width),
       strides=(10, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (20, height_width),
       strides=(20, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.squeeze(up_lc_batch, 3)
r9y9 commented 6 years ago

@npuichigo Thank you for sharing that! Did you check the output of the upsampling network? Could upsampling network actually learn upsampling? I mean, did you get high-resolution mel-spectrogram? I was wondering if I need to add loss term regarding upsampling (e.g., MSE between coarse mel-spectrogram and 1-shift high resolution mel-spectrogram) and I'm curious whether it could be learned without upsampling specific loss.

npuichigo commented 6 years ago

@r9y9 I think transposed convolution with same stride and kernel size is similar to duplicating. Like the following picture, if the kernel is one everywhere, then it's just duplicating. So maybe I need to check the values of kernel after training. padding_no_strides_transposed_test_28

r9y9 commented 6 years ago

https://r9y9.github.io/wavenet_vocoder/

Added audio samples for multi-speaker version of WaveNet vocoder.

rishabh135 commented 6 years ago

Hello @r9y9 , great work and awesome samples, would you mind sharing the weights of the network for the wavenet_vocoder trained on mel_spectrograms with CMU artic dataset without speaker embedding ? I would like to use and compare them with griffin-lim reconstruction to see which works better.

r9y9 commented 6 years ago

@rishabh135 Not at all. Here it is: https://www.dropbox.com/sh/b1p32sxywo6xdnb/AAB2TU2DGhPDJgUzNc38Cz75a?dl=0

Note that you have to use exactly same mel-spectrogram extraction https://github.com/r9y9/wavenet_vocoder/blob/f05e520e644a152f9f35a0e6b4508b3aa16b8a21/audio.py#L66-L69 and same hyper parameters https://github.com/r9y9/wavenet_vocoder/blob/f05e520e644a152f9f35a0e6b4508b3aa16b8a21/hparams.py#L20-L28

r9y9 commented 6 years ago

Using the transposed convolution below, I can get good initialization for the upsampling network. Very nice, thanks @npuichigo !

kernel_size = 3
padding = (kernel_size - 1) // 2
upsample_factor = 16

conv = nn.ConvTranspose2d(1,1,kernel_size=(kernel_size,upsample_factor),
                          stride=(1,upsample_factor), padding=(padding,0))
conv.bias.data.zero_()
conv.weight.data.fill_(1/kernel_size);

Mel-spectrogram (hop_size = 256)

download

16x upsampled mel-spectrogram

download 1

r9y9 commented 6 years ago

I have added a brief README.

jamestang0219 commented 6 years ago

I tried using mel-specs as the input condition feature(by duplicating the mel-specs of each frame) to train the wavenet model on LJSpeech Dataset, but cannot get the reasonable result after 90k steps though the loss value keeps descending.

I found that in your training procedure there is no padding before the first time step, but in generating procedure, there ARE some padding before the initial value to satisfy the current conv layer's receptive field.

Does it receive the right start information in generating procedure?

Here are some log and waveplot:

Receptive field (samples / ms): 1021 / 46.3038548753

Epoch: 1, Avg_loss: 2.78487607414

Epoch: 2, Avg_loss: 2.33630954099

Epoch: 3, Avg_loss: 2.27544190529

Epoch: 4, Avg_loss: 2.21869832258

Epoch: 5, Avg_loss: 2.10031874228

Epoch: 6, Avg_loss: 1.78190998421

Epoch: 7, Avg_loss: 1.2610225059

image image image image

r9y9 commented 6 years ago

I believe the padding for the first time step is handled by nn.Conv1d for both batch and incremental forward computation. It works at least for CMU ARCTIC. https://github.com/r9y9/wavenet_vocoder/blob/39961f5d62ae0d0d338d64719ddf6af0760f3053/wavenet_vocoder/modules.py#L50-L55

One possible reason why you cannot get good result I can think of is that speech samples in LJSpeech have reverberations. This might make it hard to learn long-term dependencies. Maybe you need more channels, layers, etc. I will also try LJSpeech soon.

By the way, I haven't got loss values < 1.9 but you seems to get the loss values < 1.5. How did you get that?

jamestang0219 commented 6 years ago

@r9y9 I don't know whether pre-emphasis and fft_size may influence the results. I use pre-emphasis=0.97, fft_size=2048.

And I didn't downsample the LJS waveform, it's 22050 samples per second.

I'll continue to try it on LJS dataset by changing model hyperparameters and architectures, and put the good results here.

the average loss is computed by below code:

while global_epoch < nepochs:
    running_loss = 0.0
    for step in enumerate(data_loader):
        '''
        training procedure here.
        '''
        loss = criterion(y_hat[:, :, :-1, :], y[:, 1:, :], mask=mask)
        print('Step: ' + str(global_step) + ', Loss: ' + str(loss.data[0]))
        running_loss += loss.data[0]
    averaged_loss = running_loss / (len(data_loader))
    global_epoch += 1
    print('Epoch: ' + str(global_epoch) + ', Avg_loss: ' + str(averaged_loss))

And the trends:

Receptive field (samples / ms): 1021 / 46.3038548753
Step: 1, Loss: 5.54383468628
Step: 2, Loss: 5.53908395767
Step: 3, Loss: 5.54337501526
...
Epoch: 1, Avg_loss: 2.78487607414
Step: 13078, Loss: 2.57374691963
Step: 13079, Loss: 2.64829707146
Step: 13080, Loss: 2.41457295418
...
Epoch: 2, Avg_loss: 2.33630954099
Step: 26155, Loss: 2.42479777336
Step: 26156, Loss: 2.46555685997
Step: 26157, Loss: 2.46027398109
...
Epoch: 3, Avg_loss: 2.27544190529
Step: 39232, Loss: 2.3482234478
Step: 39233, Loss: 2.19915151596
Step: 39234, Loss: 2.19278645515
...
Epoch: 4, Avg_loss: 2.21869832258
Step: 52309, Loss: 2.21737265587
Step: 52310, Loss: 2.24765992165
Step: 52311, Loss: 2.32877469063
...
Epoch: 5, Avg_loss: 2.10031874228
Step: 65386, Loss: 1.66409289837
Step: 65387, Loss: 1.55183362961
Step: 65388, Loss: 1.5333313942
...
Epoch: 6, Avg_loss: 1.78190998421
Step: 78463, Loss: 1.16430306435
Step: 78464, Loss: 0.585283100605
Step: 78465, Loss: 1.08736658096
...
Epoch: 7, Avg_loss: 1.2610225059
Step: 91540, Loss: 0.178501829505
Step: 91541, Loss: 0.168980106711
Step: 91542, Loss: 0.86912381649
...
Epoch: 8, Avg_loss: 0.81647025647
Step: 104617, Loss: 0.0704396516085
Step: 104618, Loss: 0.401492923498
Step: 104619, Loss: 0.103701047599
r9y9 commented 6 years ago

Thank you for the information! I will share when I can get good results. I'm trying the following hyper parameters

diff --git a/hparams.py b/hparams.py
index 3a0be85..0189c30 100644
--- a/hparams.py
+++ b/hparams.py
@@ -17,7 +17,7 @@ hparams = tf.contrib.training.HParams(
     },

     # Audio:
-    sample_rate=16000,
+    sample_rate=22050,
     silence_threshold=2,
     num_mels=80,
     fft_size=1024,
@@ -28,9 +28,9 @@ hparams = tf.contrib.training.HParams(
     ref_level_db=20,

     # Model:
-    layers=16,
+    layers=20,
     stacks=2,
-    residual_channels=256,
+    residual_channels=512,
     gate_channels=512,  # split into 2 gropus internally for gated activation
     skip_out_channels=256,
     dropout=1 - 0.95,
@@ -67,7 +67,7 @@ hparams = tf.contrib.training.HParams(
     # Loss

     # Training:
-    batch_size=1,
+    batch_size=2,
     adam_beta1=0.9,
     adam_beta2=0.999,
     adam_eps=1e-8,
@@ -81,7 +81,7 @@ hparams = tf.contrib.training.HParams(
     # This is needed for those who don't have huge GPU memory...
     # if both are None, then full audio samples are used
     max_time_sec=None,
-    max_time_steps=20000,
+    max_time_steps=8000,
jamestang0219 commented 6 years ago

In training procedure, if length of waveform is more than max_time_steps, it will be cut randomly by following code:

if max_time_steps is not None and len(x) > max_time_steps:
    s = np.random.randint(0, len(x) - max_time_steps)
    x, c = x[s:s + max_time_steps], c[s:s + max_time_steps, :]

So, the x value of first time step is not always mulaw_quantize(0). But in generating procedure, the initial value is always mulaw_quantize(0):

initial_value = mulaw_quantize(0)
print("Intial value:", initial_value)
initial_input = to_categorical(initial_value, num_classes=256).astype(np.float32)

I think that's why training loss is low, but cannot get good generating result.

r9y9 commented 6 years ago

I was hoping the edge case doesn't matter. Assuming the size of respective field is 1021, we have actually zero-padded input whose length is max_time_steps + 1020: 0, 0, ..., 0, x[0], x[1], ..., x[max_time_steps-1].

From my limited experience though, initial value is not very important when we condition the model by external features.

jamestang0219 commented 6 years ago

Thanks, I'll start another experiment using your code:)

r9y9 commented 6 years ago

step37566.zip

At step 37566 I get:

step000037566_waveplots

It seems working reasonably.

jamestang0219 commented 6 years ago

@r9y9 Congratulation! but my previous experiment cannot get good results yet. Did you use duplication or transposedConv to implement up-sampling mel specs?

I review my codes, my preprocess module is different from yours, does it cause the bad result? I didn't use lws. here is my code:

def load_wav_info(sound_file, params):
    pre_emphasis_coeff = params['pre_emphasis']

    wav, sr = librosa.load(sound_file, sr=params['sample_rate'])

    hop_length = int(params['frame_shift'] * sr)

    quantized = mulaw_quantize(wav)

    start, end = start_and_end_indices(quantized, params['silence_threshold'])

    quantized = quantized[start:end]

    wav = wav[start:end]

    y = pre_emphasis(wav, pre_emphasis_coeff)

    D = librosa.stft(y=y, n_fft=params['n_fft'],
                     hop_length=int(sr*params['frame_shift']), win_length=int(sr*params['frame_length']))

    magnitude = np.abs(D)

    filters = librosa.filters.mel(sr, params['n_fft'], n_mels=params['frame_dim'])

    mel = np.dot(filters, magnitude)

    mel = _amp_to_db(mel)

    mel = _normalize(mel, params['min_level_db'])

    mel = np.transpose(mel.astype(np.float32))

    N = mel.shape[0]

    quantized = quantized[:N * hop_length]

    return quantized, mel
r9y9 commented 6 years ago

@jamestang0219 I'm using transposed convolutions for upsampling.

Regarding to your code, did you implement pad_lr for librosa? My implementation is carefully designed for lws so you may need to adjust it for librosa. https://github.com/r9y9/wavenet_vocoder/blob/39961f5d62ae0d0d338d64719ddf6af0760f3053/audio.py#L95-L102

eval_step80000.zip

A full-length (~7 sec) eval output is attached. Still not very good, but it works.

jamestang0219 commented 6 years ago

@r9y9 I already deleted padding for librosa, by the way, any difference between librosa and lws for extracting mel specs? I found no frame_width in lws processor

r9y9 commented 6 years ago

https://librosa.github.io/librosa/generated/librosa.core.stft.html has center parameter. If center=True, I believe input signal is zero-padded. If you don't consider the padding carefully, you may have misaligned audio and mel-spectrogram.

As far as I know, due to lws is designed for phase reconstruction, lws uses careful window normalization for STFT, while librosa doesn't. However, I don't think it matters for WaveNet vocoder.

jamestang0219 commented 6 years ago

@r9y9 Hello, Using transposedConv for upsampling local conditions can get reasonable results, but duplication cannot, at least for using librosa.

Some experiment results, all based on LJSpeech: (1)transposedConv2d, 2 stacks, 16 layers, Receptive field (samples / ms): 1021 / 46.3038548753, after 40k steps Step: 45771, Loss: 2.19952297211 Step: 45772, Loss: 1.83744764328 Step: 45773, Loss: 2.57274341583 Epoch: 7, Avg_loss: 0.869552073245 image

(2)transposedConv2d, 2stacks, 20 layers, Receptive field (samples / ms): 4093 / 185.623582766, after 60k steps Step: 65388, Loss: 0.111330501735 Step: 65389, Loss: 0.907182753086 Step: 65390, Loss: 0.0600699409842 Epoch: 10, Avg_loss: 0.263670276677 image

(3)duplication, 2stacks, 2 stacks, 16 layers, Receptive field (samples / ms): 1021 / 46.3038548753, after 140k steps Step: 143845, Loss: 0.0284290295094 Step: 143846, Loss: 0.0222870074213 Step: 143847, Loss: 2.19834375381 Epoch: 11, Avg_loss: 0.172115104762 image

r9y9 commented 6 years ago

@jamestang0219 Nice! It seems transposed convolution is better than duplication as reported in the WaveNet paper.

jamestang0219 commented 6 years ago

@r9y9 I've already tried several combinations of number of layer and stack. 12 layer, 2 stacks 20 layers, 2 stacks 24 layers, 4 stacks(best MOS result in tacotron2 paper) cannot get the results as good as Google's tacotron2 demo sample or Deepmind's original wavenet demo sample yet after 150k+ steps.

Orignal wavenet model uses 256 classifications, but tacotron2 uses 10 components mixture of logistic distribution. We implement wavenet using original wavenet method, do you think the model is converged after 150k steps or not?

r9y9 commented 6 years ago

The new Parallel WaveNet paper reported they trained 1,000k steps with batch size 32 for teacher WaveNet. We may need to be more patient. In their paper, there's no mention to dropout, weight normalization, which we are using currently. There are many design choices I want to try and see how it works.

As for the mixture of logistic distributions, I'm currently working on it. See #5.

jamestang0219 commented 6 years ago

@r9y9 Great! Can't help planning to test mixture of logistic distribution loss!

r9y9 commented 6 years ago

mixture_test_step180000.zip

WIP: samples from #5

mfkfge commented 6 years ago

@r9y9 great !

mfkfge commented 6 years ago

@jamestang0219 I have tried the linguistic feature with wavenet vocoder the same as Deep voice 1. We got acceptable result with 20 layers and 64-bit. I think the model convergence at about 300k iterations. My learning rate is 1e-3 and decay every 1000 iterations with factor 0.998.

jamestang0219 commented 6 years ago

@mfkfge Could you please tell me how you extract the linguistic feature, thank you!

jamestang0219 commented 6 years ago

@r9y9 Nice, much better than 256 classification.

r9y9 commented 6 years ago

@jamestang0219 Hi, can I ask which model is the best in your experiments https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-357841092? I'm currently trying 24 layers / 4 stacks at #5 and want to know which is the best in your case.