israelg99 / deepvoice

Deep Voice: Real-time Neural Text-to-Speech
Apache License 2.0
360 stars 94 forks source link
deep-learning keras machine-learning phonemes voice

Deep Voice

Join the chat at https://gitter.im/deep-voice/Lobby
Based on the Deep Voice paper.

This repository depends on my Keras fork until it is merged with the official Keras repository.
To install: pip3 install git+https://github.com/israelg99/keras.git
This will override your previously installed Keras version.

Deep Voice is a text-to-speech system based entirely on deep neural networks.

Deep Voice comprises five models:

Grapheme-to-phoneme

Abstract

The grapheme-to-phoneme converter converts from written text (e.g English characters) to phonemes (encoded using a phonemic alphabet such as ARPABET).

Architecture

Based on this architecture but with some changes.

The Grapheme-to-phoneme converter is an encoder-decoder:

It takes written text as input.

Setup

Hyperparameters

Phoneme Segmentation

Abstract

Architecture

The segmentation model uses the convolutional recurrent neural network based on Deep Speech 2.

The architecture graph

  1. Audio vector.
  2. 20 MFCCs with 10ms stride.
  3. Double 2D convolutions (frequency bins * time).
  4. Triple bidirectional recurrent GRUs.
  5. Softmax.
  6. Output sequence of pairs.

Hyperparameters

Convolutions

Recurrent layers

Training

The segmentation model uses the connectionist temporal classification (CTC) loss.

Phoneme Duration + Frequency Predictor

Abstract

A single architecture is used to jointly predict phoneme duration and time-dependent fundamental frequency.

Phoneme Duration Abstract

The phoneme duration predictor predicts the temporal duration of every phoneme in a phoneme sequence (an utterance).

Frequency Predictor Abstract

The frequency predictor predicts whether a phoneme is voiced. If it is, the model predicts the fundamental frequency (F0) throughout the phoneme’s duration.

Architecture

  1. A sequence of phonemes with stresses, encoded in one-hot vector.
  2. Double fully-connected layers.
  3. Double unidirectional recurrent layers.
  4. Fully-connected layer.

Hyperparameters

Double fully-connected layers

Double unidirectional recurrent layers

Audio Synthesis

Abstract

Architecture

The architecture is based on WaveNet but with some changes.

Will be updated soon.