any information on the data format?

ghost commented 9 years ago

i'm trying to write a dynamic synthesis and have enough linguistic knowledge about phonemes, transcription, lpc, etc. and would either use a little neural net to translate the text directly into the target coefficients or do it by rules. before doing so, i need to understand what i would output (in terms of koefficients, energy and repetitions and possibly more?). however, i tried to understand the file format from the code. it seems do be compressed somehow since, voiced (ten) and unvoiced (four) sounds get different numbers of coefficients etc. i also don't quite get how to encode repetitions and energy.

would be great to have more info. thanks!

going-digital commented 9 years ago

The format is identical to the Texas Instruments format, as used on their speech chips.

Compression basically works like this:

Record speech and edit down.
Extract pitch, amplitude (aka energy)
Determine silence / pitch / noise frames.
For pitch and noise frames, determine optimal lattice filter coefficients (K values) using a standard LPC Levinson Durbin algorithm.
Find nearest matching coefficients in the coefficient mapping table.
Replace similar frames with frame repeat codes
Encode bitstream

Keep any eye out for the bit order within a byte - the format is bit oriented, and byte based formats use inconsistent ordering.

The only easily available software that encodes recorded speech already is QBox Pro - it's old, and I've never got it running properly. It's floating around the web in various places. A modern open source way of generating this format would be awesome, and be welcomed by many communities.

I identified two shortcuts you might want to investigate if you're interested in rule based text to speech:

"Speech" by Computer Concepts is a ROM for the BBC Micro that generates TMS5220 speech data on the fly from text using phonetic rule banks. I can see english to phoneme mapping, and a phoneme to K value table inside the ROM, but its going to take a lot of reverse engineering time to map to readable code.
"Terminal Emulator II" by Texas Instruments is an expansion cartridge for the TI99/4A. It also does english to phoneme to speech data mapping, although this time on the TMS5200 chip. Reverse engineering this one is even worse - its written in native assembly language, but also interpreted P-code. But the english to phoneme and K value tables are readable in the ROM. The only difference between the 5200 and 5220 is the coefficient table.

Bear in mind that english to phoneme to coefficient mapping is likely to take at least 8K of code and data - quite a chunk of Arduino code space.

jscuster commented 2 years ago

Just curious if anything has changed on on-the-fly tts?

I'm blind, I grew up with an Apple II E computer. The computer had an Echo II from Street Electronics installed. This is an expansion card based on the same chip.

A program, Textalker, read changes on the screen. There is an emulator of this in action on several disk images at https://bluegrasspals.com/blindapple/.

The point is that Textalker generated rule-based speech on the fly. I'm still learning how things work, but it may also be helpful.

As I said, I'm blind, so I'm very interested in finding a tts library like this.

Thanks for reading.

radiohound commented 1 year ago

@jscuster You might be interested in the work that has been done to port espeak-ng to arduino. It is located here: https://github.com/pschatzmann/arduino-espeak-ng He has it working on an ESP32.

Walter

going-digital / Talkie

any information on the data format? #11