Open pipfrosch opened 4 years ago
Where you say:
Unfortunately neither solution allows for regional pronunciation variations.
This isn't strictly true for PLS; you just can't do it in a single file. Regional variations can be provided by designating the language of the lexicon on the link declaration:
<link rel="pronunciation" type="application/pls+xml" hreflang="en-us" href="en-us.pls"/>
<link rel="pronunciation" type="application/pls+xml" hreflang="en-gb" href="en-gb.pls"/>
But the obvious fact remains that there hasn't been any appreciable uptake of these technologies, and multiple files is arguably cumbersome, so it's kind of a moot point. :)
Have you had a look at the WAI Pronunciation work, though? They're working at a solution for web-based content, so may be a more appropriate place to take your proposal.
I'll take it to that group but using the hreflang doesn't allow users to have the pronunciation for the region they prefer where specific pronunciation doesn't matter to rhythm or rhyme which can be important to comprehension.
EDIT I am going to try to write the proposal a little bit clearer and repost it at their github, that definitely looks like the right place.
That's an interesting proposal, @pipfrosch, thanks for that. Because Readium reading toolkits are using the TTS features offered by the OS via a browser API (Chromium on PC/Mac/Linux, Chrome on Android, WebKit on iOS), there is nothing Readium can do if it isn't available on the underlying OS & browser.
I didn't develop the (quite simple) TTS feature available on the Readium Mobile Android toolkit, but just I had a quick look at (Chrome TTS API)[https://developer.chrome.com/apps/tts] and (Android TTS engine)[https://developer.android.com/reference/android/speech/tts/TextToSpeech] to get an overview of what is available on Android. That is pretty limited so far; SSML is supported by the API but I didn't see anything related to lexicons. I encourage you do look at such APIs at the time you make a proposal at WAI level, so that the discussion with TTS API & engine developers can be fruitful.
SSML as specified in EPUB 3 is used in Japan. Lentrance Reader supports it. Tokyo Shoseki (the biggest textbook publisher in Japan) uses it. There was a government project for the promotion of SSML. Here is one of its reports (in Japanese). I am sure that I can find more.
The Problem
Accessibility is really important to me, but I will probably never have the funds to provide audio versions of what Pipfrosch Press is publishing. Some of my planned publications will have frequent content updates rather than just being static publications. For example, my planned field guide to Contra Costa County will likely never ever be finished, with new species accounts added every year and existing species accounts modified with some frequency.
For the print-disabled user, Text To Speech (TTS) Synthesis will be how they access the content.
ePub currently has two different mechanisms for providing pronunciation hints to TTS Synthesizers, PLS and SSML.
When there is only one way to pronounce a grapheme, PLS is the better option as it allows a single document that can be updated as needed, either by the ePub publisher or by a school / library as needed. PLS also supports multiple phonetic alphabets at the same time.
Where there are multiple ways to pronounce a grapheme, SSML is better because it allows the specific pronunciation to be specified for the use case of the grapheme. However SSML only allows a single phonetic alphabet to be specified.
Unfortunately neither solution allows for regional pronunciation variations.
Even though both PLS and SSML have been in the ePub standard for some time, they are not implemented by the vast majority of ePub viewers. I have heard of one custom viewer used by a Japanese school district that implements them, but I was not able to confirm it.
I recommend a new solution, a single solution that covers both use cases as well as allowing for region-specific pronunciations and allows for as many different phonetic alphabets as the ePub publisher knows about.
This solution does not have to be restricted to ePub but could work with any digital publishing format, including websites and PDF (though perhaps not as an embedded solution within PDF, I do not know).
This probably should only become part of the ePub standard if Apple, Google, and EDRLab are on board and are committed to implementing it in their software. How to get them on-board I have no clue, I have social anxiety and as a result do not often portray confidence when proposing solutions, even were I to find a way to get their ear, and unfortunately when proposing something without an appearance of confidence, those with the power to implement can not see past the presentation to see the value of what is being presented.
This solution probably needs to be adjusted by those with far more experience in the issues related to TTS Synthesis than I have, but this solution should be fairly easy to extend as is.
It probably needs to be yet another W3C project for experts in the field to refine. It is my hope that someone who knows how to work the system to make things happen sees the value in this and runs with it. I do not need any credit if that happens, I just want a solution that works well as I publish my ePubs. A solution that brings my ePubs to print-disabled users enjoyment rather than frustration.
JSON Pronunciation Library
Example JSON file attached.
The format for the JSON Pronunciation Library shall be JSON. JSON was chosen for the ease of which valid JSON files may be generated from a number of programming languages from database queries, including Python and PHP. I am personally a big fan of XML but this I think should be JSON.
The character set for the JSON pronunciation library will be UTF-8.
The first definition in a the JSON pronunciation library shall be
lang
and either be assigned a string value of a BCP-47 language code or a list of BCP-47 language codes.Examples:
In most cases, the generic language is to be preferred over a localized language.
The text to speech synthesizer will only use a JSON Pronunciation Library that matches the current specified language within the (X)HTML document. For example, if the current document is specified as "en-US" then a JSON Pronunciation Library with
lang="es"
would not be used for pronunciations except for a string within a node labeled with the XML attributelang="es"
. This is to avoid collisions where languages that share the same alphabet have words with an identical grapheme but are pronounced quite differently, allowing the Text to Speech Synthesizer to use its own pronunciation algorithms in the event that an entry exists for one language but does not exist for the language specified for the string being read.Pronunciation Context Dictionary
The JSON Pronunciation Library will have at least one context dictionary named
default
but may have additional context dictionaries. In the example JSON Pronunciation Library, additional context dictionaries namedtaxonomy
(for taxonomy names) andproper
(for proper names) are provided.The
default
context dictionary is to be used by TTS synthesizers either when a context is not specified or when the grapheme is not found in the specified context dictionary.Each context dictionary will have a list named
entries
grapheme entry
Each context dictionary entry list item must have a
grapheme
definition that specifies either a string or a list of strings. Examples:The specified
grapheme
should not be interpreted as case sensitive.In cases where only one pronunciation for that grapheme is provided, one of more phonetic alphabets with the corresponding phoneme can be specified. An example that provides a phoneme for both
ipa
andx-sampa
:The text to speech synthesizer can then pick the alphabet it has the best support for and use that phoneme to pronounce the grapheme.
speechpart
In some languages, the same grapheme may have a different pronunciation depending upon the part of speech it is used in. For example, the grapheme
wind
in English is pronounced differently---and has a different meaning---depending upon if it is noun (or adjective) or a verb.In those cases, a
speechpart
can be defined and the (X)HTML author should specify the speech part with a span element. Thespeechpart
will then hold either the phoneme or regional variation phoneme. An example:When the
speechpart
is not specified by the (X)HTML the text to speech synthesizer may attempt to detect the speech part based upon a grammatical parsing of the sentence, as some seem to do already, but best practice should be for the (X)HTML author to specify thespeechpart
as an XML attribute to a span element around the grapheme.When the
speechpart
is not determined or does not match a specifiedspeechpart
then the firstspeechpart
should be used. In the above example, with the sentence "That is a beautiful wind turbine" wind is an adjective but since a pronunciation for the graphemewind
as an adjective is not specified, the noun phoneme forwind
would be used since it is the first definedspeechpart
.Regional Pronunciation
Within the same language, sometimes a grapheme has a different pronunciation depending upon political borders or cultural grouping.
An example of this is the grapheme
vase
. It seems to be pronounced differently in America than in Great Britain than in Australia, though I am not positive about the latter.In those cases, a list of phonemes for the grapheme may be provided. For example:
In these cases, the
lang
specifies the pronunciation language rather than the document language. A British user reading an ePub that specifiesen-US
will probably prefer that words be pronounced the British way and in fact may have lower comprehension if they are pronounced the American way.However there are cases, such as poetry where rhymes and near-rhymes are important, where the (X)HTML author should be able to specify that a particular regional variation of the language be used.
Context Dictionary Use Cases
In some cases, such as taxonomy names and proper names, the correct way to pronounce a word may differ from the way the same grapheme is ordinarily pronounced.
The (X)HTML author should be able to define context dictionaries for these special cases and use an attribute in a span or other element around the string that alerts the text to speech synthesizer to look in the specified context dictionary for the pronunciation before looking in the default context dictionary. What the author names these dictionaries should be up to the author.
(X)HTML Attributes
The written language should be detected from the specified language in the ePub OPF file
<dc:language></dc:language>
element, but allowing that language to be over-ridden within an XHTML document with thelang
attribute, such as one might have for a bibliography entry for a work that is written in a different language.At least in English and most languages I am familiar with, words are delimited by white-space. How to specify a sub-string including a space is a grapheme the TTS Synthesizer should look up in the library I have not yet considered but it would be an attribute to a parent span (or whatever) node. Probably a binary attribute (the kind represented without a value in HTML but has any value in XML to indicate True). I understand that in Australia, they call a root beer float a spider. Things like that could, at the discretion of the (X)HTML author, be accommodated for by specifying
root beer float
as a grapheme. That is probably a very poor example, but there are other examples where those of us who are not print-disabled see a string but read it in our minds as words other than what is printed. Especially strings that involve an abbreviation.For the other attributes...
If it were up to me, I would create
speech-
attributes the TTS synthesizer could trigger off us.For specifying the
speechpart
, something likeFor specifying the spoken language to be used when it is critical that a particular regional pronunciation be used:
Note that the
speech-region
attribute should trigger the text speech to text synthesizer to use an algorithm for the specified region even for a grapheme that is not specified in the JSON Pronunciation Library. But in that example, even if the speech synthesizer only had an algorithm for American English, it would still read the graphemevase
correctly in the rhyme.For specifying the context dictionary, something like:
Please lets make this happen. For the present, even though no one is implementing them, I will use PLS and SSML but those systems have limitations that could easily be solved by this kind of a pronunciation library.
Thank you for your time. pronunciation.json.txt