Closed erogol closed 4 years ago
on my laptop set up (i7) the model as is has real time factor of ~1.6 running with device set as "cpu". One can reduce the generator size to get it to sub real time, for example reducing the residual layers from 30 to 10 gives around ~ 0.46 RTF. however model convergence is much slower, I'll post update when more training results come.
Great thx for a prompt answer.
Hi @erogol and @G-Wang. I calculated with Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads. Here is the result.
# I set OMP_NUM_THREADS=16 in path.sh
$ CUDA_VISIBLE_DEVICES="" ./run.sh --stage 3
Stage 3: Network decoding
2019-11-06 08:42:39,818 (decode:91) INFO: the number of features to be decoded = 250.
2019-11-06 08:42:40,028 (decode:105) INFO: loaded model parameters from exp/train_nodev_ljspeech_parallel_wavegan.v1/checkpoint-400000steps.pkl.
[decode]: 100%|██████████| 250/250 [22:16<00:00, 5.35s/it, RTF=0.841]
2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).
If we can use a powerful CPUs, it can generate less than real-time.
Compared to the original parallel wavenet, the parallel wavegan is already small. So I'm curious about the results with layers=10
. I'm looking forward to seeing the results of @G-Wang!
@kan-bayashi I am curious to give a try to a new feature of PyTorch 1.3 quantization
.
@rishikksh20 I've never tried that feature, but it sounds interesting. Is it like fp16 inference with Apex?
Not exactly It does reduce precision from fp32 to INT8, but it only supports CPUs also there is a new feature called Quantization-aware training
. It speeds up inference speed up to 4x and also reduces the model size so that it easily deployed from mobile cpu to raspberry pi.
More here: https://pytorch.org/docs/stable/quantization.html
Check the quantize model performance here: https://github.com/opencv/openvino_training_extensions/tree/develop/pytorch_toolkit/nncf
Note: Though the inference speed of the quantized model will be excellent but the quality of sound definitely reduces.
Can we also try batching the signal for faster inference? I think it is even more applicable since this model does not have time dependency.
@rishikksh20 Thank you for your info. That is interesting. I will consider to implement the option.
@erogol Yes, that is more straightforward. Actually, I already tried but it is consuming too much memory. So large batchsize cannot be used. I will re-check the implementation.
Oh I remembered why I gave up batch inference. When we perform the batch inference, we perform padding. But the generator is non-causal network and as a result padded part will be considered. I did not come up with how to mask not to consider the padded part. So I simplified the inference.
Do you have any idea about masking?
@erogol Do you mean batched synthesis for waveRNN like here: https://github.com/fatchord/WaveRNN/blob/12922a780f0d65a4572f6de27f33ec8f3189cfe8/models/fatchord_version.py#L295.
Batching helped with waveRNN inference because it was autoregressive and by splitting up a melspectrgram into pieces we cut down the autoregressive synthesis time by batch size factor.
However I'm not sure how much it would help for this model as everything is synthesised in one pass, regardless if they're batched or not. I guess the one advantage here is that we would not have issues stitching the batches together due to boundary artifacts like we had in WaveRNN.
@G-Wang How was your result with layers=10
?
@erogol Do you mean batched synthesis for waveRNN like here: https://github.com/fatchord/WaveRNN/blob/12922a780f0d65a4572f6de27f33ec8f3189cfe8/models/fatchord_version.py#L295.
Batching helped with waveRNN inference because it was autoregressive and by splitting up a melspectrgram into pieces we cut down the autoregressive synthesis time by batch size factor.
However I'm not sure how much it would help for this model as everything is synthesised in one pass, regardless if they're batched or not. I guess the one advantage here is that we would not have issues stitching the batches together due to boundary artifacts like we had in WaveRNN.
I haven't tried this model yet (although I'm looking forward to when I get time), but with MelGAN I got an inference speed of ~6500kHz on GPU and ~40kHz on CPU when I concatenated a couple of sentences together and ran it through those batching/folding functions. It was trivial to implement as well - just a copy/paste job with minor edits. I would imagine it would be just as easy to try with WaveGAN.
@kan-bayashi The model convergence was very slow, and thus I've moved my compute to other thing.
I've modified the models so that adversarial loss doesn't kick in until after these steps.
Default model (30 residual layers, 3 stack): 100k step, mag loss: ~0.75, spec loss: ~0.38 smallest model (12 residual layers, 3 stack): 300k step, mag loss: ~0.78, spec loss: ~0.70 medium model (15 residual layers, 3 stack): 180k step, mag loss: ~0.79, spec loss: ~0.62
Final audio quality for smallest and medium models are much worse than baseline model at 100k
The speed up for small and medium are about around the correct ratio compared to full model. Full model at 1.6 RTF, smallest at ~0.5 RTF, medium at ~0.8 RTF.
Let me know if you want the checkpoints, logs, etc.
I stopped training both because I saw the loss curve was plateauing quite flat. I adjusted learning rate a bit but seems to not help too much.
@fatchord Thank you for your suggestions! It looks interesting. Just concatenating the sequence is OK? Did you put the silence between the two sequences?
@G-Wang Thank you for your valuable report! This will help us. From your results, it seems that it is difficult to make the network small to accelerate the inference. Again, the default setting is already much smaller than the Parallel WaveNet, so maybe the authors of Parallel WaveGAN investigated the size of the network.
@fatchord Thank you for your suggestions! It looks interesting. Just concatenating the sequence is OK? Did you put the silence between the two sequences?
I didn't explicitly put in any silence (although you could if you wanted I guess).
I tried to split the input and batch them for inference but the gain is just infinitesimal on my system.
Basically I split the input, run them in parallel as a single batch and concat the outputs.
@erogol, how small was the gain you got to? I'm guessing you ran on GPU
@rishikksh20 have you attempted quantization of the model with Pytorch and can share results?
Perhaps someone here have also tried pruning techniques or acceleration frameworks (e.g. TVM) and can share their results?
Thank you
I haven't tried this model yet (although I'm looking forward to when I get time), but with MelGAN I got an inference speed of ~6500kHz on GPU and ~40kHz on CPU when I concatenated a couple of sentences together and ran it through those batching/folding functions. It was trivial to implement as well - just a copy/paste job with minor edits. I would imagine it would be just as easy to try with WaveGAN.
@fatchord Hi, for the batch inference, do you mean to concatenate several sentences into one long sentence, and apply 'fold with over lap' to the long mel-spectrogram? So actually, it is the same as combining mel-spectrorams from different sentences to form a batch array?(if they all have the same length)
how to decode on CPU?
Hi! Thx for the repo. I was curious about the performance on CPU. AFAIK, it is 8x real-time on GPU but could you also share some values about CPU performance?