Open ghost opened 9 years ago
The format is identical to the Texas Instruments format, as used on their speech chips.
Compression basically works like this:
Keep any eye out for the bit order within a byte - the format is bit oriented, and byte based formats use inconsistent ordering.
The only easily available software that encodes recorded speech already is QBox Pro - it's old, and I've never got it running properly. It's floating around the web in various places. A modern open source way of generating this format would be awesome, and be welcomed by many communities.
I identified two shortcuts you might want to investigate if you're interested in rule based text to speech:
Bear in mind that english to phoneme to coefficient mapping is likely to take at least 8K of code and data - quite a chunk of Arduino code space.
Just curious if anything has changed on on-the-fly tts?
I'm blind, I grew up with an Apple II E computer. The computer had an Echo II from Street Electronics installed. This is an expansion card based on the same chip.
A program, Textalker, read changes on the screen. There is an emulator of this in action on several disk images at https://bluegrasspals.com/blindapple/.
The point is that Textalker generated rule-based speech on the fly. I'm still learning how things work, but it may also be helpful.
As I said, I'm blind, so I'm very interested in finding a tts library like this.
Thanks for reading.
@jscuster You might be interested in the work that has been done to port espeak-ng to arduino. It is located here: https://github.com/pschatzmann/arduino-espeak-ng He has it working on an ESP32.
Walter
i'm trying to write a dynamic synthesis and have enough linguistic knowledge about phonemes, transcription, lpc, etc. and would either use a little neural net to translate the text directly into the target coefficients or do it by rules. before doing so, i need to understand what i would output (in terms of koefficients, energy and repetitions and possibly more?). however, i tried to understand the file format from the code. it seems do be compressed somehow since, voiced (ten) and unvoiced (four) sounds get different numbers of coefficients etc. i also don't quite get how to encode repetitions and energy.
would be great to have more info. thanks!