Support material for Hangul filler

lifthrasiir commented 8 years ago

This piece is a feedback for Hangul filler section. This was originally posted at Reddit but I guess you can alter the wording to incoporate the following:

U+3164 HANGUL FILLER is one of the stupidest choices made by character sets. Hangul is noted for its algorithmic construction and Hangul charsets should ideally be following that. Unfortunately, the predominant method for multibyte encoding was ISO 2022 and EUC and both required a rather small repetoire of 94 × 94 = 8,836 characters [1] which are much less than required 19 × 21 × 28 = 11,172 modern syllables.

The initial KS X 1001 charset, therefore, only contained 2,350 frequent syllables (plus 4,888 Chinese characters with some duplicates, themselves becoming another Unicode headache). Notwithstanding the fact that remaining syllables are NOT supported, this resulted in a significant complexity burden for every Hangul-supporting software and there were confusion and contention between KS X 1001 and less interoperable "compositional" (johab) encodings before Unicode. The standardization committee has later acknowledged the charset's shortcoming, but only by adding four-letter (thus eight-byte) ad-hoc combinations for all remaining syllables! The Hangul filler is a designator for such combinations, e.g. <fliler>ㄱㅏ<filler> denotes 가 and <filler>ㅂㅞㄺ denotes 뷁 (not in KS X 1001 per se).

Hangul filler was too late in the scene that it had virtually no support from software industry. Sadly, the filler was there and Unicode had to accept it; technically it can be used to designate a letter (even though Unicode does not support the combinations) so the filler itself should be considered as a letter as well. What, the, hell.

[1] It is technically possible to use 94 × 94 × 94 = 830,584 characters with three-byte encoding, but as far as I know there is no known example of such charset designed (thus no real support too).

(Feel free to ask me about Hangul and more generally CJK support in Unicode.)

jagracey commented 8 years ago

This was really insightful! Thanks so much. I left CJK out because I personally have zero understanding of anything CJK- despite reading the reports, CodePoints.net, and Wikipedia.

Awesome-Unicode deserves a dedicated section on CJK in addition to an explanation about why the Hangul Filler is a thing ( which you've done quite well).

Would be able to contribute a section in CJK?

lifthrasiir commented 8 years ago

@jagracey I think I'm a terrible writer :P (even in my native tongue). I can list some interesting bits about CJK and Unicode, however:

Han unification, a large-scale merger of historically equivalent variants of Chinese characters. The problem was that the historical equivalence does not mean the actual equivalence; some names are, for example, always written in a particular variant. I think (hasn't verified so far) this has something to do with the limited size of BMP (they are among the first characters ever encoded in Unicode, when it was still confined to 16 bits). Of course there are tons of exemptions, mainly because many such characters were not unified in the source charsets, and also subsequent un-unifications as well. It is believed that variation selectors and the Ideographic Variation Database will finally solve this long-standing problem, though the implementation (both fonts and softwares) is not yet widespread enough.
Hangul filler being an invisible "ordinary" letter, explained above.
Duplicate Chinese characters. There are several classes: some are plain bugs (U+FA0C and U+FA0D from Big5), some are just stupid (KS X 1001, mentioned above, has hundreds of duplicate characters only differing by their reading---very unreliable in practice). We also have Kangxi radicals duplicated.
Chinese characters of unknown origin. Most significantly, those from JIS X 0208. One of them (彁) is doubly unknown that how it got scrambled is also unknown.
Compatibility symbols. Very, very, very useful for optimizing tweets because CJK charsets traditionally had lots of "squared" letters (e.g. U+3392 encodes "MHz" in one character). Obviously there are also mistakes; Ken Lunde has described one particular example where the squared unit or abbreviation symbol does not correspond to anything in use due to a typo.
Normalization fail. Hangul syllables have two canonical forms in Unicode, one composed (e.g. U+AC00 가) and one decomposed (e.g. U+1100 ㄱ + U+1161 ㅏ). They are equivalent to each other after the normalization, but there is a significant error in the algorithm that the archaic Hangul syllable with no single combined letter is partially composed (e.g. U+1103 U+1172 U+11F0 듀ᇰ is archaic due to the final component, and the first two components are combined into U+B4C0 듀 resulting in two letters). As the Unicode algorithm cannot be corrected now, the result is a separate Hangul-friendly normalization algorithm (KS X 1026-1) for the domestic use.
"Korean mess" (yes, it has its own term): Hangul is the only script in the entire Unicode history that has been moved. Hangul was not compositionally encoded before Unicode 2.0, and was located at U+3400..4DFF. As it became increasingly clear that the entirety of modern Hangul will be eventually encoded, in no particular order, there has been lengthy discussions and eventually Hangul get relocated to U+AC00..D7FF, this time with the algorithmic mapping. This is also why the primary CJK ideograph block starts at U+4E00, right after the original Hangul blocks. (The historical note. According to multiple accounts the entire process was almost a split vote and confusing enough.)
We are missing Vietnam from the name "CJK (unified) ideograph", even though Vietnam had used Chinese characters in the past. Yes, we cannot change that now.
Did you know that a Chinese character encoding GB 18030 is actually a pan-Unicode encoding similar to UTF-8? (Designed to be compatible to the legacy encoding GBK, of course.)
And of course, Emoji is of the Japanese origin. It is all fault of three Japanese mobile carriers---blame them.

Probably more, but I cannot recall others right now.

nexusanphans commented 5 years ago

@lifthrasiir so, how many methods available are there to type hangul? I only know two: precomposed syllables vs jamos.

lifthrasiir commented 5 years ago

@nexusanphans You cannot directly write precomposed syllables (there are 11,172 modern ones), there should be multiple keystrokes to complete one syllable. Hangul is simple enough to not require a dictionary-based complex IME necessary for Japanese or Chinese, but is complex enough to allow many clever ideas to optimize for. I'm aware of at least 30+, and most popular ones include:

A bipartite (두벌식) method, standardized as KS X 5002, has two sets of jamos (consonants and vowels) and compose a syllable as much as possible (e.g. "ㄷㅜㅂㅓㄹㅅㅣㄱ" -> "ㄷㅜ/ㅂㅓㄹ/ㅅㅣㄱ" -> "두벌식"). It is possible that a consonant is temporarily placed in a wrong syllable (so-called jack-o'-lantern symptom [도깨비불 현상]). A variant of this keyboard to reduce the number of required keys is also in use in mobile phones.
Tripartite methods have three sets of jamos (initial, medial and final) and guarantee a syllable is completed as the final jamo is typed. It is slightly more complex because you now need to assign about 50% more keys, but can be alleviated by moving infrequent jamos to inconvenient places (the first row, or require shifting). There are multiple incarnations of tripartite methods, and essentially all modern OS supports 3-91 (also called tripartite final "세벌식 최종") and 3-90 by default.
There are multiple competing mobile keyboards originated from the 3 by 4 keypad layout. Most make use of the fact that Hangul jamos were designed in a group (ㄱ -> ㅋ, ㄴ -> ㄷ -> ㅌ or ㄹ, etc.). Among them Cheonjiin (천지인) and Naratgeul (나랏글) were ad hoc standardized by requiring a mandatory cross-licensing in 2000s.

nexusanphans commented 5 years ago

@lifthrasiir Thank you for your detailed answer, although what I meant to inquire was low-level (i.e. at the level of Unicode codepoints) method, for which I only know two: precomposed syllables vs jamos (initial, medial, and final). The former uses the Hangul Syllables block (AC00–D7A3), while the latter uses Hangul Jamo block along with two extended blocks. Precomposed syllables encode the whole syllables, while jamos are analogous to Latin letters, but with separate initial vs final version to allow precise syllable breaking.

There is another block, however, that of Hangul Compatibility Jamo (3130–318F). Looking at this Wikipedia page, it seems to be used only for compatibility purpose with another, older encoding system (Unicode is too concerned with backward compatibility, IMO). It is supposed to behave like jamos but with no separation of initial vs final consonants, which may complicate things since such arrangement can form ambiguous syllables.

lifthrasiir commented 5 years ago

Looking at this Wikipedia page, it seems to be used only for compatibility purpose with another, older encoding system (Unicode is too concerned with backward compatibility, IMO).

That was an intention, but practically compatibility jamos are used everywhere. In the other words, modern Korean IMEs do operate at the combining jamo level before the commit, but the candidate text is materialized as compatibility jamos when they can't combine with each other.

The motivating example would be a Korean equivalent to "lol" and "*sob*": "ㅋㅋㅋㅋㅋㅋㅋㅋㅋ" and "ㅠㅠ". They are not intended to compose, so "ㅋㅋㅋㅋ" followed by "ㅠㅠ", once written, should not compose into "ㅋㅋㅋ큐ㅠ" like combining jamos. You can often see that composed form though because people often don't signal that fact (by explicitly committing the text) to the IME.

jagracey / Awesome-Unicode

Support material for Hangul filler #4