allow control over the way acronyms and special data is processed

Most text-to-speech engines are inflexible in the way that content handling (ordinals, cardinals, percentages, dates, etc.) are handled, along with acronyms.

For a given language, the following things need to be provided:

character classification rules -- used for identifying words, ordinals, cardinals, punctuation, abbreviations, etc.
part of speech tagging rules -- used to identify different word forms (e.g. "read" - /r'i:d/ vs /r'Ed/), associating a variant to them (e.g. "read" -> read/1 (verb) vs read/2 (verb, past))
classified type to word rules -- used to normalize the text stream to a word list (e.g. "St. Noun" -> "saint noun" , "Noun St." -> "noun street" and "St. Noun St." -> "saint noun street"; same for "Dr." -> doctor/drive)
pronunciation dictionary -- used to map word/variant to a pronunciation transcription (e.g. read/1 -> /r'i:d/)
acronym dictionary -- used to map acronyms and abbreviations to words
letter to phoneme rules -- used to handle words not in the pronunciation dictionary [*]
phoneme to phoneme rules -- used to handle prosodic morphology (e.g. vowel weakening on unstressed vowels)

[*] Strictly speaking, an exception dictionary should be created with any word from the pronunciation dictionary that cannot be constructed using the letter to phoneme and phoneme to phoneme rules. This allows the exception dictionary to be small and the letter to phoneme rules to be tested and verified against a reference set of words.

It should be possible to choose the classification scheme and abbreviation rules for the document being read. For example, using email/SMS abbreviations in email documents.

Where possible, the text-to-speech engine should select appropriate defaults, but this behaviour should be overridable (e.g. supressing US state abbreviation expansion on addresses).

For the UI, this could be handled as a drop-down with a list of profiles ("email/sms", "novel", "technical", "chess", etc.)

rhdunn / cainteoir-engine

allow control over the way acronyms and special data is processed #29