interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Extract "spelling system" code for each system for ISO 24229 #541

Open ronaldtse opened 3 years ago

ronaldtse commented 3 years ago

Each Interscript map converts from one spelling system to another spelling system.

Originally, we considered that these maps converts from script to script (as in ISO 15924 script codes), e.g. Arab to Latn, Cyrl to Latn. However, this assumption is broken when we have systems that are Latn to Latn (e.g. table of correspondences).

In addition, the script codes fail to recognize that Arab is a generic script class, it says nothing about the characters used in a language. For example, Farsi, standard Arabic and Urdu use some different Arabic characters and are incompatible with each other. Similarly, the Cyrillic character sets also differ per languages -- there are Cyrillic characters that are only used by a single language.

If we introduce the concept of stages, we (a computer) will be unable to keep track of "what sort of Cyrl" the output is.

There are 3 use cases we see:

  1. The Latn used by Estonian pre-spelling reform and post-spelling reform should be differentiated. Similarly, Latn used by German (with diacritics and the SS) should be differentiated against Latn used by English.

  2. Some systems like sac-zho-Hans-Latn-1979 and sasm-mon-Mong-Latn-general-1978 both produce Latn of the pinyin system. DIN systems produce output in German orthography.

  3. Some spelling systems are not meant to be pronounced, such as the output of ALA-LC systems which can use obscure diacritics (that hardly anyone knows how to read correct) for preservation of information for reverse transliteration.

ISO 24229 will be updated to include the concept of "spelling systems" where each spelling system has a specified encoding. The scope of these encodings is limited to the need of interoperable processing (e.g. transliterating, if a system is not used in transliteration, there is no need to encode the text).

This task is to assign a "spelling system code" for each Interscript map's input and output.

ronaldtse commented 3 years ago

Even if two spelling systems have identical character sets, they may still need to be encoded as separate systems (e.g. English vs pinyin).

ronaldtse commented 3 years ago

Omniglot specifies 290 writing systems: https://omniglot.com/writing/

So the output set is limited.