rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

support generating phoneme (and prosody) output #27

Closed rhdunn closed 10 years ago

rhdunn commented 11 years ago

The aim here is to provide an API to the underlying phoneme and prosody information for a given document being spoken. This API should produce information that can be used to synchronize the phonemes with the document (e.g. word and sentence highlighting). It should consume a document reader instance (so it can handle prosody cleanly) and expose a reader-style API (allowing flexibility in how it is used).

Therefore, synthesis is split into:

DOCUMENT -> document reader -> phoneme/prosody reader -> synthesizer -> AUDIO

This means that the phoneme/prosody reader (language) and synthesizer (voice) are handled separately and should be independently selectable through an appropriate API. That is, the phoneme/prosody reader should be selected on the document's language (or overriden by the user) and the synthesizer should be selected based on the available voices for that language. If these use different phoneme transcription schemes, the phonemes will be converted using an appropriate mapping.

The phoneme/prosody reader should therefore expose the phonemes in a transcription-independent form (e.g. using the underlying phonological features as described in Kirshenbaum's ascii-ipa).

The phoneme data can be written out by skipping the synthesizer step.

Phonemes can be synthesized by hooking up the phoneme/prosody reader to a phoneme file/data buffer (generating a document reader even containing a phoneme group).

NOTE: espeak cannot be used as both the phoneme/prosody reader and voice synthesizer unless the phoneme/prosody step is done in one go (performance issue) as espeak_Synth needs to be called for both.

It would be ideal to show mbrola voices independently of the espeak engine (and acknowledge the voices as being mbrola voices, not espeak voices) and have mbrola support directly in cainteoir-engine.

Ultimately, the phoneme/prosody reader and synthesizer should be implemented directly in cainteoir-engine with the ability to read the espeak language data (for the phoneme/prosody reader) and espeak phoneme data (for the synthesizer). This should allow different language rules/dictionaries to be used (cmudict, olead, cainteoir, ...) and different sounding voices.

rhdunn commented 10 years ago

There are now phoneme and prosody APIs available, with readers and writers for both. For engines that support it (espeak), phoneme output can be generated. Anything else will be built on top of this architecture.