i18n of Unicode names - Githubissues

pkra commented 8 years ago

As discussed recently in a dev meeting, for i18n of the speechruleengine, we need translations of many Unicode names.

We've reached out to the Unicode, MathML, and TeX communities and there seems to be no existing efforts towards localizing Unicode names.

For some locales, there are resources we might be able to tap into. E.g., for German, de.Wiki has math Ops and math symbols, Rainer Seitel's site includes many references to German standards like DIN that are very useful. There are some CJK efforts @zorkow knew from back when he was working on ChromeVox.

This thread should help discuss various approaches. To be clear: currently SRE does not have any localization so this is very early from a practical / implementation point of view. But I think it's worthwhile starting now.

@siebrand @Nikerabbit has this ever come up on TranslateWiki.net? Is this kind of effort possibly of interest to its community? It could improve different Wikipedias and via the speechruleengine feed back into ChromeVox (once they start working on i18n), possibly even NVDA.

Nikerabbit commented 8 years ago

What do you mean with Unicode names? Do you mean the Unicode blocks and/or Unicode glyph names?

pkra commented 8 years ago

Do you mean the Unicode blocks and/or Unicode glyph names?

We are thinking of the glyph names, e.g., "for all" in U+2200 (8704) | ∀ | FOR ALL | Allquantor (from the mathOps/deWiki link above).

Maybe some background is useful. Screenreaders don't voice most Unicode characters, so tools that generate speech-text (like SRE) "spell out" the Unicode name (i.e., replace ∀ with for all). Generating that data is easy for English obviously but for lack of translations not possible for other languages.

See also https://phabricator.wikimedia.org/T120184 about the new math extension for MediaWiki/Wikipedia which would be interested in using SRE's output to enhance image rendering.

Nikerabbit commented 8 years ago

Is the aim for natural (for some definition of it) synthesized speech or just translating literally from the English Unicode glyph names? I would imagine that the translations would differ for the two different aims.

pkra commented 8 years ago

Is the aim for natural (for some definition of it) synthesized speech or just translating literally from the English Unicode glyph names?

Hard to say at this point. I'd imagine both. Cf. the example above: for all vs Allquantor (universal quantifier).

The other German link (Seitel's page) shows the other problem/opportunity: aligning with standards in other languages.

I would imagine that the translations would differ for the two different aims

I think so, too.

Nikerabbit commented 8 years ago

For this to be considered in translatewiki.net, you should define the aim (as per above) clearly as well as the scope (all glyphs? only math? etc.) as well as figure out whether existing sources can be incorporated license wise.

nemobis commented 8 years ago

glyph names, e.g., "for all" in U+2200 (8704) | ∀ | FOR ALL | Allquantor

This is something that should be proposed to CLDR, IMO (with all the details Nikerabbit mentioned above). http://unicode.org/cldr/trac/newticket

The translations could then be used via any of the libraries which incorporate CLDR data, like all the ICU libraries. We do the same for language names, which are used via the "cldr" MediaWiki extension.

mathjax / MathJax-i18n

i18n of Unicode names #12