Closed G-Wang closed 5 years ago
Hi, Yes, that is right. I'm trying to get real-time vocoding. I tried doing the sparse RNN implementation (like in the original paper): opened a pull request on DyNet, implemented a runtime version which uses sparse MKL multiplications and so on. It decreased the sampling time three times, but it was no way near realtime. So, I started to work on a different Vocoder.
With these lines:
amax_vect = dy.reshape(dy.inputVector([(ii / 128) - 1.0 for ii in range(256)]), (1, 256))
networks_output.append(amax_vect * dy.argmax(softmax_outputs[-1], gradient_mode="zero_gradient"))
I actually got sampling to work on the DyNET side (CPU or GPU). I managed to synthesize speech faster than real time. (it takes 7 seconds to synthesize 8).
The results are not good enough yet, but I have the feeling that this might actually work. I'm training on LJ Speech Corpus
I see. Are you looking to do the synthesis on CPU only then?
Also for vocoder generation, how are you doing it? Are you generating one sentence at a time? If so, have you tried splitting up your mel spectrogram into smaller parts, batch these up, generate the batch, then stitch the batch of wav into a single wav?
I'm working on a wavernn vocoder in Pytorch, and I can cut my generation time to close to real time (https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/model.py#L223) by the above batch generation.
This stitched audio wav sounds very close to a wav generated in one go. I think more optimization can be achieved by using a larger batch, and having overlapping mels in the batch, etc.
For context on my machine (1060Ti, 16 GB ram) I can get about 2000 samples / second without batching. With batching I can get up to probably 8000/9000 samples/second without visible loss of audio quality after stitching.
Of course I'm still testing this out atm.
I think I'm gonna give your vocoder a try. I have a 1080TI, so maybe it will get real-time synthesis. Do you have any pretrained model?
Yes they're on the README page (https://github.com/G-Wang/WaveRNN-Pytorch).
I haven't update the repo with explicit synthesis function yet, but you can test it out by loading the model (https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/train.py#L220) , and calling the generate
function by passing a numpy mel spectrogram of shape [80 x mel_length]
. I.e similar to the evaluate_model
function here: https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/train.py#L95
Hi I'm trying to understand your Beecoder vocoder, is it just a MLP with a fixed lookback window?