How long does it take to train?

naba89 commented 4 years ago

Hi,

I am trying to reproduce your results, and am wondering how long does it take to train the nsf-v3 model on a single speaker (CMU US SLT).

I want to know after approximately how many iterations do you get legible speech? I am trying to train my implementation and currently even after 300k iterations, the results are not satisfactory. The generated speech (copy synthesis) sounds robotic and also not entirely legible. I am trying to debug and see if I have done something wrong, however, I am not sure how long I need to wait to see if the results are going to be legible.

Any help/advice would be really helpful.

Note: I am using WORLD based f0 extractor. Does it make too much difference?

TonyWangX commented 4 years ago

Hi,

What is your definition of "iteration"? Sorry that I am not familiar with the new terminologies. In my case, one training iteration over the whole corpus is one epoch. 10 epochs can produce reasonably good speech. And the samples in my git rep are generated after 50 epochs.
What is the spectral features you use? In my case, both Mel-spectrogram (which can be extracted using the scripts in my git rep) and Mel-cepstral coefficients work. If Mel-spectrogram, what is your configuration? Sampling-rate? Frame-shift? Frame-length. There is no golden configuration, but the configuration should be confirmed.
WORLD is good enough for F0 extraction on CMU SLT
Could you send me a few synthetic samples?

naba89 commented 4 years ago

HI,

By iteration I meant each training batch. In your case 50 epochs will be 50*1132 iterations in case of the CMU SLT with a batch size of 1.
I am using mel spectrogram as spectral features. Configuration is same as L1 loss configuration mentioned in the paper (512 bins, 320 frame length and 80 frame shift). Sampling rate is 16kHz.
OK.
I am using 1000 utterances for training and the remaining for inference. Training with 1 second long segments and 32 samples per batch. This is a generated sample after 200 epochs. recon_arctic_b0445.zip
After this i tried to train on just a single utterance just to check if I can get legible outputs. This output was a little better. overfit.zip
I am re-implementing the algorithm using pytorch, am not using this repository as I am not familiar with CURRENT.

naba89 commented 4 years ago

Hi,

I have a few doubts regarding the architecture.

Do the shaded and non-shaded regions in fig.2 of the paper share parameters or are the completely different layers? Meaning do you simply increase the output channels of the CONV layer by 1 and use that as the cutoff frequency prediction or do you have seperate BLSTM and CONV layers for the cutoff frequency prediction?
Do you apply the time domain average smoothing only to the cutoff frequency prediction or to the entire conditioning signal?
Do you use batch normalization anywhere in the network?
Do you do any kind of scaling at any point to ensure that the output waveform is in the range (-1, 1)?

I identified a few problems with my implementation (mainly in the source module and sinc filtering module) and fixed them, and trained with 3 second long segments and batch size of 10 and a learning rate of 3e-4. I am getting reasonable outputs now but not as good as yours (recon_arctic_b0443.zip). However, the training is still slower than you mentioned, these outputs are after about 90 epochs. I am trying to make sure I implement the architecture correctly.

TonyWangX commented 4 years ago

Hi, sorry to hear that.

I can only answer based on what I have tried. Depending on the implementation tools, my answer may not work in your case.

Bi-LSTM and CONV in shaded regions can be independent from the non-shaded regions; they can also share the parameters with the non-shaded BLSTM and CONV. Both works in my own experiments. For the CMU models in this git repo, I simply used one dimension output from the CONV.
Time smoothing is definitely necessary for cut-off-frequency; otherwise, hard switch of filter chefs causes artifacts. Smoothing is NOT necessary for conditional signals. I have compared them in experiments.
In your sample recon_arctic_b0445.wav, the harmonic and noise components have quite different DC levels. Therefore, batch normalization may help by forcing the harmonic and noise components have a zero mean. I only tried batch normalization at the output layer of each filter block (i.e., the FF on the right side of Figure 2). Note that batch normalization accelerates optimization, but, when generating test sentences with long starting/ending silence, the produced waveform may have a DC offset since the mean/std of the hidden signal is affected by the silence.
There is no guarantee that the produced waveform is between (-1, 1).
Regarding the training speed, do you mean the time to run one epoch or the time to train and get good samples?

There are many unknown factors that may go wrong. How about first reproducing the result using the simplified NSF in https://arxiv.org/abs/1904.12088? This can be done by simply removing the noise component from your implementation. This may be a good starting point to identify what is missing and what should be revised.

Or, simply visiting NII next week if you live in Tokyo.

Finally, sorry for the troubles. I am in holiday and cannot do more to help

naba89 commented 4 years ago

Hi,

I am sorry I couldn't reply earlier, got caught up in coursework. Anyways, I was able to reproduce the paper with relatively good quality outputs. An I want to thank you for your help. Our discussion really helped in figuring out stuff and ironing out some of my mistakes.

Hoping to visit NII soon and meet you in person.

Best regards Nabarun

hyysam commented 4 years ago

Hi，naba89, Can you share your pytorch code? I am also not familiar with CURRENNT and can not reproduce the result, thank you!

TonyWangX commented 4 years ago

@naba89 @hyysam I re-implemented the pytorch version here https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts If you would like to try

nii-yamagishilab / project-CURRENNT-scripts

How long does it take to train? #3