tiberiu44 / TTS-Cube

End-2-end speech synthesis with recurrent neural networks
https://tiberiu44.github.io/TTS-Cube/
Apache License 2.0
225 stars 45 forks source link

What is BeeCoder? #13

Closed G-Wang closed 5 years ago

G-Wang commented 6 years ago

Hi I'm trying to understand your Beecoder vocoder, is it just a MLP with a fixed lookback window?

tiberiu44 commented 6 years ago

Hi, Yes, that is right. I'm trying to get real-time vocoding. I tried doing the sparse RNN implementation (like in the original paper): opened a pull request on DyNet, implemented a runtime version which uses sparse MKL multiplications and so on. It decreased the sampling time three times, but it was no way near realtime. So, I started to work on a different Vocoder.

With these lines:

amax_vect = dy.reshape(dy.inputVector([(ii / 128) - 1.0 for ii in range(256)]), (1, 256))
networks_output.append(amax_vect * dy.argmax(softmax_outputs[-1], gradient_mode="zero_gradient"))

I actually got sampling to work on the DyNET side (CPU or GPU). I managed to synthesize speech faster than real time. (it takes 7 seconds to synthesize 8).

The results are not good enough yet, but I have the feeling that this might actually work. I'm training on LJ Speech Corpus

G-Wang commented 6 years ago

I see. Are you looking to do the synthesis on CPU only then?

Also for vocoder generation, how are you doing it? Are you generating one sentence at a time? If so, have you tried splitting up your mel spectrogram into smaller parts, batch these up, generate the batch, then stitch the batch of wav into a single wav?

I'm working on a wavernn vocoder in Pytorch, and I can cut my generation time to close to real time (https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/model.py#L223) by the above batch generation.

This stitched audio wav sounds very close to a wav generated in one go. I think more optimization can be achieved by using a larger batch, and having overlapping mels in the batch, etc.

G-Wang commented 6 years ago

For context on my machine (1060Ti, 16 GB ram) I can get about 2000 samples / second without batching. With batching I can get up to probably 8000/9000 samples/second without visible loss of audio quality after stitching.

Of course I'm still testing this out atm.

tiberiu44 commented 6 years ago

I think I'm gonna give your vocoder a try. I have a 1080TI, so maybe it will get real-time synthesis. Do you have any pretrained model?

G-Wang commented 6 years ago

Yes they're on the README page (https://github.com/G-Wang/WaveRNN-Pytorch).

I haven't update the repo with explicit synthesis function yet, but you can test it out by loading the model (https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/train.py#L220) , and calling the generate function by passing a numpy mel spectrogram of shape [80 x mel_length]. I.e similar to the evaluate_model function here: https://github.com/G-Wang/WaveRNN-Pytorch/blob/a9860b8185ce89359bca9363953f27ab65e02700/train.py#L95