kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.58k stars 343 forks source link

How is the runtime on CPU? #28

Closed erogol closed 4 years ago

erogol commented 5 years ago

Hi! Thx for the repo. I was curious about the performance on CPU. AFAIK, it is 8x real-time on GPU but could you also share some values about CPU performance?

G-Wang commented 5 years ago

on my laptop set up (i7) the model as is has real time factor of ~1.6 running with device set as "cpu". One can reduce the generator size to get it to sub real time, for example reducing the residual layers from 30 to 10 gives around ~ 0.46 RTF. however model convergence is much slower, I'll post update when more training results come.

erogol commented 5 years ago

Great thx for a prompt answer.

kan-bayashi commented 5 years ago

Hi @erogol and @G-Wang. I calculated with Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads. Here is the result.

# I set OMP_NUM_THREADS=16 in path.sh
$ CUDA_VISIBLE_DEVICES="" ./run.sh --stage 3
Stage 3: Network decoding
2019-11-06 08:42:39,818 (decode:91) INFO: the number of features to be decoded = 250.
2019-11-06 08:42:40,028 (decode:105) INFO: loaded model parameters from exp/train_nodev_ljspeech_parallel_wavegan.v1/checkpoint-400000steps.pkl.
[decode]: 100%|██████████| 250/250 [22:16<00:00,  5.35s/it, RTF=0.841]
2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).

If we can use a powerful CPUs, it can generate less than real-time.

Compared to the original parallel wavenet, the parallel wavegan is already small. So I'm curious about the results with layers=10. I'm looking forward to seeing the results of @G-Wang!

rishikksh20 commented 5 years ago

@kan-bayashi I am curious to give a try to a new feature of PyTorch 1.3 quantization.

kan-bayashi commented 5 years ago

@rishikksh20 I've never tried that feature, but it sounds interesting. Is it like fp16 inference with Apex?

rishikksh20 commented 5 years ago

Not exactly It does reduce precision from fp32 to INT8, but it only supports CPUs also there is a new feature called Quantization-aware training. It speeds up inference speed up to 4x and also reduces the model size so that it easily deployed from mobile cpu to raspberry pi. More here: https://pytorch.org/docs/stable/quantization.html Check the quantize model performance here: https://github.com/opencv/openvino_training_extensions/tree/develop/pytorch_toolkit/nncf Note: Though the inference speed of the quantized model will be excellent but the quality of sound definitely reduces.

erogol commented 5 years ago

Can we also try batching the signal for faster inference? I think it is even more applicable since this model does not have time dependency.

kan-bayashi commented 5 years ago

@rishikksh20 Thank you for your info. That is interesting. I will consider to implement the option.

@erogol Yes, that is more straightforward. Actually, I already tried but it is consuming too much memory. So large batchsize cannot be used. I will re-check the implementation.

kan-bayashi commented 5 years ago

Oh I remembered why I gave up batch inference. When we perform the batch inference, we perform padding. But the generator is non-causal network and as a result padded part will be considered. I did not come up with how to mask not to consider the padded part. So I simplified the inference.

Do you have any idea about masking?

G-Wang commented 5 years ago

@erogol Do you mean batched synthesis for waveRNN like here: https://github.com/fatchord/WaveRNN/blob/12922a780f0d65a4572f6de27f33ec8f3189cfe8/models/fatchord_version.py#L295.

Batching helped with waveRNN inference because it was autoregressive and by splitting up a melspectrgram into pieces we cut down the autoregressive synthesis time by batch size factor.

However I'm not sure how much it would help for this model as everything is synthesised in one pass, regardless if they're batched or not. I guess the one advantage here is that we would not have issues stitching the batches together due to boundary artifacts like we had in WaveRNN.

kan-bayashi commented 5 years ago

@G-Wang How was your result with layers=10?

fatchord commented 5 years ago

@erogol Do you mean batched synthesis for waveRNN like here: https://github.com/fatchord/WaveRNN/blob/12922a780f0d65a4572f6de27f33ec8f3189cfe8/models/fatchord_version.py#L295.

Batching helped with waveRNN inference because it was autoregressive and by splitting up a melspectrgram into pieces we cut down the autoregressive synthesis time by batch size factor.

However I'm not sure how much it would help for this model as everything is synthesised in one pass, regardless if they're batched or not. I guess the one advantage here is that we would not have issues stitching the batches together due to boundary artifacts like we had in WaveRNN.

I haven't tried this model yet (although I'm looking forward to when I get time), but with MelGAN I got an inference speed of ~6500kHz on GPU and ~40kHz on CPU when I concatenated a couple of sentences together and ran it through those batching/folding functions. It was trivial to implement as well - just a copy/paste job with minor edits. I would imagine it would be just as easy to try with WaveGAN.

G-Wang commented 5 years ago

@kan-bayashi The model convergence was very slow, and thus I've moved my compute to other thing.

I've modified the models so that adversarial loss doesn't kick in until after these steps.

Default model (30 residual layers, 3 stack): 100k step, mag loss: ~0.75, spec loss: ~0.38 smallest model (12 residual layers, 3 stack): 300k step, mag loss: ~0.78, spec loss: ~0.70 medium model (15 residual layers, 3 stack): 180k step, mag loss: ~0.79, spec loss: ~0.62

Final audio quality for smallest and medium models are much worse than baseline model at 100k

The speed up for small and medium are about around the correct ratio compared to full model. Full model at 1.6 RTF, smallest at ~0.5 RTF, medium at ~0.8 RTF.

Let me know if you want the checkpoints, logs, etc.

I stopped training both because I saw the loss curve was plateauing quite flat. I adjusted learning rate a bit but seems to not help too much.

kan-bayashi commented 5 years ago

@fatchord Thank you for your suggestions! It looks interesting. Just concatenating the sequence is OK? Did you put the silence between the two sequences?

@G-Wang Thank you for your valuable report! This will help us. From your results, it seems that it is difficult to make the network small to accelerate the inference. Again, the default setting is already much smaller than the Parallel WaveNet, so maybe the authors of Parallel WaveGAN investigated the size of the network.

fatchord commented 5 years ago

@fatchord Thank you for your suggestions! It looks interesting. Just concatenating the sequence is OK? Did you put the silence between the two sequences?

I didn't explicitly put in any silence (although you could if you wanted I guess).

erogol commented 5 years ago

I tried to split the input and batch them for inference but the gain is just infinitesimal on my system.

Basically I split the input, run them in parallel as a single batch and concat the outputs.

kwanUm commented 4 years ago

@erogol, how small was the gain you got to? I'm guessing you ran on GPU

kwanUm commented 4 years ago

@rishikksh20 have you attempted quantization of the model with Pytorch and can share results?

Perhaps someone here have also tried pruning techniques or acceleration frameworks (e.g. TVM) and can share their results?

Thank you

LuoDQ commented 4 years ago

I haven't tried this model yet (although I'm looking forward to when I get time), but with MelGAN I got an inference speed of ~6500kHz on GPU and ~40kHz on CPU when I concatenated a couple of sentences together and ran it through those batching/folding functions. It was trivial to implement as well - just a copy/paste job with minor edits. I would imagine it would be just as easy to try with WaveGAN.

@fatchord Hi, for the batch inference, do you mean to concatenate several sentences into one long sentence, and apply 'fold with over lap' to the long mel-spectrogram? So actually, it is the same as combining mel-spectrorams from different sentences to form a batch array?(if they all have the same length)

ben-8878 commented 3 years ago

how to decode on CPU?