w3c / epub-specs

Shared workspace for EPUB 3 specifications.
Other
304 stars 60 forks source link

Language tag clarification needed #1439

Closed ThomasR128 closed 3 years ago

ThomasR128 commented 3 years ago

Suppose I have an ePub in German with some Ancient Greek in it, both in the Greek alphabet and transcribed. As I understand the accessibility guidelines, Αἰσχύλος should be declared as xml:lang="grc". Is this also the case for transcriptions (Aischylos)? I'm asking because I'm afraid if I don't, a reading system would pronounce the "sch" the German way [​ʃ] when it should be [sç] – but on the other hand wouldn't a reading system expect xml:lang="grc" to always be written in the Greek alphabet?

iherman commented 3 years ago

I wonder whether grc-Latn would be the right answer. The tool of @r12a doesn't complain, but I don't kmow whether that makes it valid. @r12a would certainly know...

GeorgeKerscher commented 3 years ago

Hi,

Transcribed? Translated? Or Transliterated? I would assume the translation would need the lang for the language it was translated into. Same goes for transliteration, I believe.

Best

George

mattgarrish commented 3 years ago

This is also where SSML and PLS lexicons are supposed to save the day but don't. Adding the correct pronunciation would avoid a TTS engine guessing the correct way to pronounce the name using local dialect rules. They're notoriously bad at mangling names and all other manner of proper nouns.

Unfortunately, neither technology is supported in reading systems.

ThomasR128 commented 3 years ago

@GeorgeKerscher – Great comment. Let me elaborate.

Transliteration (unambiguous character mapping) often uses all kinds of “funny” diacritics, usually leaves “the uninitiated” without a clue and can be a pain in the eyes. TTS however should be fairly easy to implement via a simple look-up table. Tagging would certainly be necessary, but how? @iherman ’s idea might be worth a try.

Translation is a non-issue since we’re already in the target language, thus no tagging required.

Well-done orthographic transcription tries to achieve some sort of “good enough” pronunciation in the target language. Пу́шкин would become Pushkin in English, and Pouchkine in French. TTS would cope with that, so no tagging needed, as I see it.

Bad orthographic transcription, i.e. a mix of transcription and transliteration (e.g. Greeklish) should get ironed out between an author and their copy editor. The above example is from a 19th century german book I’m re-editing, so I’d simply correct these, mention it in an edition notice and be done with it. Again, no tagging needed.

I was referring to different pronunciations of a given combination of letters. The “well, you just have to know it” cases. They occur in the publication’s natural language (e.g. record as noun / verb) – in English probably more often than in many other languages. They also occur in transcriptions, as in the above Greek-to-German example: sch is pronounced as [sç] and not as [ ʃ].

As @mattgarrish says, SSML and PLS are but a nice concept. A hack for the above example might be a hyphen in a display:none span… (yuk). A phonetic tag that uses the IPA would be nice…

So again, given current limitations, what would be the tagging best-practice for these cases, both for natural language and transcriptions?

mattgarrish commented 3 years ago

Any solution we might come up with here is just going to be a hack, I suspect. If the TTS engine doesn't include the name in its internal pronunciation library, I don't believe there's anything much you can do at this time. Changing the language of the translated name probably only makes it come out slightly less weird at best, as there's no guarantee TTS engines will support more than a few basic languages anyway.

This problem isn't unique to German, either, as a translation to "Aeschylus" for English will produce the same issue. This was why we pre-rendered text-to-speech where I used to work, as we could hack the rendering going into the TTS engine without affecting the actual text of the publication that was distributed to users (this was for daisy text/audio books, though). That was the only reliable way we found to work around this issue.

The problem with altering the text of the publication that goes out to users is that it could affect the rendering in braille displays, for example, so in solving one problem you may create another. The solutions may also fall apart if the text ends up in a reading system with little or no CSS support.

This is why the work of the WAI pronunciation task force is so important, as we need to solve this problem.

ThomasR128 commented 3 years ago

@mattgarrish Thanks for the insights. Maybe I was aiming too high. Proper pronunciation is WCAG level AAA, after all.

r12a commented 3 years ago

Here are some suggestions:

Αἰσχύλος would be tagged xml:lang="...". Depending on the content you are actually working with, the ... could be grc rather than grk, or even el-monoton (looks to me as if not polytonic). See the various alternatives at https://r12a.github.io/app-subtags/?find=greek

The Aeschylus of @mattgarrish looks to me like English, so i'd mark that as xml:lang="en", however if the example @ThomasR128 used in the original question (Aischylos) is actually a transcription rather than a German word, then the language tag is probably the one you prefer just above plus -Latn (for arguments sake, let's choose grc: then it would be grc-Latn).

If you really wanted to get more clever about this, without the hacks, i think there are two possible approaches:

  1. use the -t extension to BCP47, use a transcription method that the reading system recognises, and label it that way. For example, you might end up with xml:lang="grc-Latn-t-grc-m0-LOC", which means 'ancient Greek written in the Latin script using the Library of Congress standard transcription'. Of course, this requires the reader to know how to handle the particular transcription method you cite, which may be a tall ask, given the number of transcription methods times the number of languages out there. A lot of stars need to line up for that to work.

  2. use IPA, label the content as xml:lang="grc-Latn-fonipa". This also requires the reader to recognise that this is IPA and to know what to do with it, but at least it has the advantage of being a single transcription that works for all languages.

<rant> Btw, with regards to using IPA for ordinary people, let me say that it's not so hard to get used to, especially when you're only dealing with a small number of languages, and once you are it can be very effective. I work with transcriptions pretty much every day, and for a large number of languages, and i can tell you that trying to replicate the sounds of Uighur or Arabic using English letters for English readers is not helpful at all. Not to mention the fact that many letters or groups of letters can be quite misleading, depending on the accent or origin of the reader – i constantly have to ask myself whether the person who wrote the transcription is American or British or Indian, etc., to guess what they mean when they try to provide 'English' equivalents, especially for vowels. Then there are the ASCII letters that keep popping up in many different guises: in Old English 'c' sounds like 'k', in Slavic languages it typically represents 't͡s', in other languages it stands for 't͡ʃ', etc. and in transcriptions it is often used for palatal stops, or for affricates, so when you see it in a transcription you're never quite sure how to treat it. Then there's the fact that different sources use very different transcriptions, which often overlap adding to the confusion. This from RFC6497: "Gaddafi" is commonly transliterated from Arabic to English as any of (G/Q/K/Kh)a(d/dh/dd/dhdh/th/zz)af(i/y)."

Basically, it's a mess. And ironically it's caused by people thinking that they are making phonetic transcriptions more user friendly. Just this weekend i was trying to figure out pronunciations for the letters of the Bamum syllabary (used for a language in Cameroon) from sources that all used slightly different transcriptions and that weren't fully IPA based. I wrote my conclusions up using IPA to try to cast some light for others, but it took me hours to figure it out.

So i'm actually a big fan of IPA. It may look scary at first to the layman, but it's actually really simple, easy to learn, and extremely effective. And having learnt it, you don't need to learn something new when you come to the next language or example. </rant>

ThomasR128 commented 3 years ago

@r12a Wow, that was the answer I was hoping for, thanks. So I'll be going for grc-latn where necessary, since the grc pronunciation taught in Germany is very much molded after ordinary German, and it's mainly the odd sch and eu that differ depending on where in the word they occur.

For the nerds among us, Αἰσχύλος I'd say is grc (as per the original post), Aeschylus is (at least in German) the latinised version, thus la, and Aischylos is a rather garbled grc-to-de transcription (for the [ɛː] I'd prefer Ä over Ai, but that's just me)…

<rant>At least in Mr. G's case the a and af are agreed upon… 3 out of 7 is not too bad, is it?</rant>

I very much like suggestion 2 since this is exactly what IPA was created for, and every decent online dictionary (and often Wikipedia) has it ready for copy-and-paste into anyone's preferred editor. Like, ['aɪ̯sçʏlɔs]. An ipa attribute might be even more appropriate than using xml:lang since IPA is not really a language.

And I'll leave my doubts on how to coerce a TTS into correctly pronouncing regnal numbers in various languages for another time…

iherman commented 3 years ago

Thanks @r12a