pipfrosch commented 4 years ago

The Problem

Accessibility is really important to me, but I will probably never have the funds to provide audio versions of what Pipfrosch Press is publishing. Some of my planned publications will have frequent content updates rather than just being static publications. For example, my planned field guide to Contra Costa County will likely never ever be finished, with new species accounts added every year and existing species accounts modified with some frequency.

For the print-disabled user, Text To Speech (TTS) Synthesis will be how they access the content.

ePub currently has two different mechanisms for providing pronunciation hints to TTS Synthesizers, PLS and SSML.

When there is only one way to pronounce a grapheme, PLS is the better option as it allows a single document that can be updated as needed, either by the ePub publisher or by a school / library as needed. PLS also supports multiple phonetic alphabets at the same time.

Where there are multiple ways to pronounce a grapheme, SSML is better because it allows the specific pronunciation to be specified for the use case of the grapheme. However SSML only allows a single phonetic alphabet to be specified.

Unfortunately neither solution allows for regional pronunciation variations.

Even though both PLS and SSML have been in the ePub standard for some time, they are not implemented by the vast majority of ePub viewers. I have heard of one custom viewer used by a Japanese school district that implements them, but I was not able to confirm it.

I recommend a new solution, a single solution that covers both use cases as well as allowing for region-specific pronunciations and allows for as many different phonetic alphabets as the ePub publisher knows about.

This solution does not have to be restricted to ePub but could work with any digital publishing format, including websites and PDF (though perhaps not as an embedded solution within PDF, I do not know).

This probably should only become part of the ePub standard if Apple, Google, and EDRLab are on board and are committed to implementing it in their software. How to get them on-board I have no clue, I have social anxiety and as a result do not often portray confidence when proposing solutions, even were I to find a way to get their ear, and unfortunately when proposing something without an appearance of confidence, those with the power to implement can not see past the presentation to see the value of what is being presented.

This solution probably needs to be adjusted by those with far more experience in the issues related to TTS Synthesis than I have, but this solution should be fairly easy to extend as is.

It probably needs to be yet another W3C project for experts in the field to refine. It is my hope that someone who knows how to work the system to make things happen sees the value in this and runs with it. I do not need any credit if that happens, I just want a solution that works well as I publish my ePubs. A solution that brings my ePubs to print-disabled users enjoyment rather than frustration.

JSON Pronunciation Library

Example JSON file attached.

The format for the JSON Pronunciation Library shall be JSON. JSON was chosen for the ease of which valid JSON files may be generated from a number of programming languages from database queries, including Python and PHP. I am personally a big fan of XML but this I think should be JSON.

The character set for the JSON pronunciation library will be UTF-8.

The first definition in a the JSON pronunciation library shall be lang and either be assigned a string value of a BCP-47 language code or a list of BCP-47 language codes.

Examples:

"lang": "en"
"lang": "en-US", "en-GB"

In most cases, the generic language is to be preferred over a localized language.

The text to speech synthesizer will only use a JSON Pronunciation Library that matches the current specified language within the (X)HTML document. For example, if the current document is specified as "en-US" then a JSON Pronunciation Library with lang="es" would not be used for pronunciations except for a string within a node labeled with the XML attribute lang="es". This is to avoid collisions where languages that share the same alphabet have words with an identical grapheme but are pronounced quite differently, allowing the Text to Speech Synthesizer to use its own pronunciation algorithms in the event that an entry exists for one language but does not exist for the language specified for the string being read.

Pronunciation Context Dictionary

The JSON Pronunciation Library will have at least one context dictionary named default but may have additional context dictionaries. In the example JSON Pronunciation Library, additional context dictionaries named taxonomy (for taxonomy names) and proper (for proper names) are provided.

The default context dictionary is to be used by TTS synthesizers either when a context is not specified or when the grapheme is not found in the specified context dictionary.

Each context dictionary will have a list named entries

grapheme entry

Each context dictionary entry list item must have a grapheme definition that specifies either a string or a list of strings. Examples:

"grapheme": "job"

"grapheme": ["estivate", "aestivate", "æstivate"]

The specified grapheme should not be interpreted as case sensitive.

In cases where only one pronunciation for that grapheme is provided, one of more phonetic alphabets with the corresponding phoneme can be specified. An example that provides a phoneme for both ipa and x-sampa:

{
  "grapheme": ["estivate", "aestivate", "æstivate"],
  "ipa": "ˈɛstɪˌveɪt",
  "x-sampa": "EstI%veIt"
}

The text to speech synthesizer can then pick the alphabet it has the best support for and use that phoneme to pronounce the grapheme.

speechpart

In some languages, the same grapheme may have a different pronunciation depending upon the part of speech it is used in. For example, the grapheme wind in English is pronounced differently---and has a different meaning---depending upon if it is noun (or adjective) or a verb.

In those cases, a speechpart can be defined and the (X)HTML author should specify the speech part with a span element. The speechpart will then hold either the phoneme or regional variation phoneme. An example:

{
  "grapheme": "wind",
  "speechpart": {
    "noun" : {
      "ipa": "wɪnd",
      "x-sampa": "wInd"  
    },
    "verb" : {
      "ipa": "waɪnd",
      "x-sampa": "waInd"
    }
  }
}

When the speechpart is not specified by the (X)HTML the text to speech synthesizer may attempt to detect the speech part based upon a grammatical parsing of the sentence, as some seem to do already, but best practice should be for the (X)HTML author to specify the speechpart as an XML attribute to a span element around the grapheme.

When the speechpart is not determined or does not match a specified speechpart then the first speechpart should be used. In the above example, with the sentence "That is a beautiful wind turbine" wind is an adjective but since a pronunciation for the grapheme wind as an adjective is not specified, the noun phoneme for wind would be used since it is the first defined speechpart.

Regional Pronunciation

Within the same language, sometimes a grapheme has a different pronunciation depending upon political borders or cultural grouping.

An example of this is the grapheme vase. It seems to be pronounced differently in America than in Great Britain than in Australia, though I am not positive about the latter.

In those cases, a list of phonemes for the grapheme may be provided. For example:

{
  "grapheme": "vase",
  "languages" : [
    {
      "lang": "en-US",
      "ipa": "veɪs",
      "x-sampa": "veIs"
    },
    {
      "lang": ["en-GB", "en-IE"]
      "ipa": "vɑz",
      "x-sampa": "vAz"
    },
    {
      "lang": "en-AU",
      "ipa" : "vɐːz",
      "x-sampa": "v6:z"
    }
  ]
}

In these cases, the lang specifies the pronunciation language rather than the document language. A British user reading an ePub that specifies en-US will probably prefer that words be pronounced the British way and in fact may have lower comprehension if they are pronounced the American way.

However there are cases, such as poetry where rhymes and near-rhymes are important, where the (X)HTML author should be able to specify that a particular regional variation of the language be used.

Context Dictionary Use Cases

In some cases, such as taxonomy names and proper names, the correct way to pronounce a word may differ from the way the same grapheme is ordinarily pronounced.

The (X)HTML author should be able to define context dictionaries for these special cases and use an attribute in a span or other element around the string that alerts the text to speech synthesizer to look in the specified context dictionary for the pronunciation before looking in the default context dictionary. What the author names these dictionaries should be up to the author.

(X)HTML Attributes

The written language should be detected from the specified language in the ePub OPF file <dc:language></dc:language> element, but allowing that language to be over-ridden within an XHTML document with the lang attribute, such as one might have for a bibliography entry for a work that is written in a different language.

At least in English and most languages I am familiar with, words are delimited by white-space. How to specify a sub-string including a space is a grapheme the TTS Synthesizer should look up in the library I have not yet considered but it would be an attribute to a parent span (or whatever) node. Probably a binary attribute (the kind represented without a value in HTML but has any value in XML to indicate True). I understand that in Australia, they call a root beer float a spider. Things like that could, at the discretion of the (X)HTML author, be accommodated for by specifying root beer float as a grapheme. That is probably a very poor example, but there are other examples where those of us who are not print-disabled see a string but read it in our minds as words other than what is printed. Especially strings that involve an abbreviation.

For the other attributes...

If it were up to me, I would create speech- attributes the TTS synthesizer could trigger off us.

For specifying the speechpart, something like

<p>Remember to <span speech-part="verb">wind</span> your watch once a week.</p>

For specifying the spoken language to be used when it is critical that a particular regional pronunciation be used:

<p>Tim reverted to his British roots when he started rhyming about the cool vase
he found in the woods, the perfect mother’s day gift he otherwise could not afford:</p>
<p speech-region="en-GB">“The vase, so boss, was buried in the moss.”</p>

Note that the speech-region attribute should trigger the text speech to text synthesizer to use an algorithm for the specified region even for a grapheme that is not specified in the JSON Pronunciation Library. But in that example, even if the speech synthesizer only had an algorithm for American English, it would still read the grapheme vase correctly in the rhyme.

For specifying the context dictionary, something like:

<p>According to <abbr>Dr.</abbr> <span speech-context="proper">Job</span> Walters...</p>

Please lets make this happen. For the present, even though no one is implementing them, I will use PLS and SSML but those systems have limitations that could easily be solved by this kind of a pronunciation library.

Thank you for your time. pronunciation.json.txt

mattgarrish commented 4 years ago

Where you say:

Unfortunately neither solution allows for regional pronunciation variations.

This isn't strictly true for PLS; you just can't do it in a single file. Regional variations can be provided by designating the language of the lexicon on the link declaration:

<link rel="pronunciation" type="application/pls+xml" hreflang="en-us" href="en-us.pls"/>
<link rel="pronunciation" type="application/pls+xml" hreflang="en-gb" href="en-gb.pls"/>

But the obvious fact remains that there hasn't been any appreciable uptake of these technologies, and multiple files is arguably cumbersome, so it's kind of a moot point. :)

Have you had a look at the WAI Pronunciation work, though? They're working at a solution for web-based content, so may be a more appropriate place to take your proposal.

pipfrosch commented 4 years ago

I'll take it to that group but using the hreflang doesn't allow users to have the pronunciation for the region they prefer where specific pronunciation doesn't matter to rhythm or rhyme which can be important to comprehension.

EDIT I am going to try to write the proposal a little bit clearer and repost it at their github, that definitely looks like the right place.

llemeurfr commented 4 years ago

That's an interesting proposal, @pipfrosch, thanks for that. Because Readium reading toolkits are using the TTS features offered by the OS via a browser API (Chromium on PC/Mac/Linux, Chrome on Android, WebKit on iOS), there is nothing Readium can do if it isn't available on the underlying OS & browser.

I didn't develop the (quite simple) TTS feature available on the Readium Mobile Android toolkit, but just I had a quick look at (Chrome TTS API)[https://developer.chrome.com/apps/tts] and (Android TTS engine)[https://developer.android.com/reference/android/speech/tts/TextToSpeech] to get an overview of what is available on Android. That is pretty limited so far; SSML is supported by the API but I didn't see anything related to lexicons. I encourage you do look at such APIs at the time you make a proposal at WAI level, so that the discussion with TTS API & engine developers can be fruitful.

murata2makoto commented 3 years ago

SSML as specified in EPUB 3 is used in Japan. Lentrance Reader supports it. Tokyo Shoseki (the biggest textbook publisher in Japan) uses it. There was a government project for the promotion of SSML. Here is one of its reports (in Japanese). I am sure that I can find more.

w3c / publishingcg

Phonetic Markup Proposal #5