jagracey / Awesome-Unicode

:joy: :ok_hand: A curated list of delightful Unicode tidbits, packages and resources.
https://git.io/Awesome-Unicode
Creative Commons Zero v1.0 Universal
906 stars 66 forks source link

Support material for Hangul filler #4

Open lifthrasiir opened 8 years ago

lifthrasiir commented 8 years ago

This piece is a feedback for Hangul filler section. This was originally posted at Reddit but I guess you can alter the wording to incoporate the following:

U+3164 HANGUL FILLER is one of the stupidest choices made by character sets. Hangul is noted for its algorithmic construction and Hangul charsets should ideally be following that. Unfortunately, the predominant method for multibyte encoding was ISO 2022 and EUC and both required a rather small repetoire of 94 × 94 = 8,836 characters [1] which are much less than required 19 × 21 × 28 = 11,172 modern syllables.

The initial KS X 1001 charset, therefore, only contained 2,350 frequent syllables (plus 4,888 Chinese characters with some duplicates, themselves becoming another Unicode headache). Notwithstanding the fact that remaining syllables are NOT supported, this resulted in a significant complexity burden for every Hangul-supporting software and there were confusion and contention between KS X 1001 and less interoperable "compositional" (johab) encodings before Unicode. The standardization committee has later acknowledged the charset's shortcoming, but only by adding four-letter (thus eight-byte) ad-hoc combinations for all remaining syllables! The Hangul filler is a designator for such combinations, e.g. <fliler>ㄱㅏ<filler> denotes and <filler>ㅂㅞㄺ denotes (not in KS X 1001 per se).

Hangul filler was too late in the scene that it had virtually no support from software industry. Sadly, the filler was there and Unicode had to accept it; technically it can be used to designate a letter (even though Unicode does not support the combinations) so the filler itself should be considered as a letter as well. What, the, hell.

[1] It is technically possible to use 94 × 94 × 94 = 830,584 characters with three-byte encoding, but as far as I know there is no known example of such charset designed (thus no real support too).

(Feel free to ask me about Hangul and more generally CJK support in Unicode.)

jagracey commented 8 years ago

This was really insightful! Thanks so much. I left CJK out because I personally have zero understanding of anything CJK- despite reading the reports, CodePoints.net, and Wikipedia.

Awesome-Unicode deserves a dedicated section on CJK in addition to an explanation about why the Hangul Filler is a thing ( which you've done quite well).

Would be able to contribute a section in CJK?

lifthrasiir commented 8 years ago

@jagracey I think I'm a terrible writer :P (even in my native tongue). I can list some interesting bits about CJK and Unicode, however:

Probably more, but I cannot recall others right now.

nexusanphans commented 5 years ago

@lifthrasiir so, how many methods available are there to type hangul? I only know two: precomposed syllables vs jamos.

lifthrasiir commented 5 years ago

@nexusanphans You cannot directly write precomposed syllables (there are 11,172 modern ones), there should be multiple keystrokes to complete one syllable. Hangul is simple enough to not require a dictionary-based complex IME necessary for Japanese or Chinese, but is complex enough to allow many clever ideas to optimize for. I'm aware of at least 30+, and most popular ones include:

nexusanphans commented 5 years ago

@lifthrasiir Thank you for your detailed answer, although what I meant to inquire was low-level (i.e. at the level of Unicode codepoints) method, for which I only know two: precomposed syllables vs jamos (initial, medial, and final). The former uses the Hangul Syllables block (AC00–D7A3), while the latter uses Hangul Jamo block along with two extended blocks. Precomposed syllables encode the whole syllables, while jamos are analogous to Latin letters, but with separate initial vs final version to allow precise syllable breaking.

There is another block, however, that of Hangul Compatibility Jamo (3130–318F). Looking at this Wikipedia page, it seems to be used only for compatibility purpose with another, older encoding system (Unicode is too concerned with backward compatibility, IMO). It is supposed to behave like jamos but with no separation of initial vs final consonants, which may complicate things since such arrangement can form ambiguous syllables.

lifthrasiir commented 5 years ago

Looking at this Wikipedia page, it seems to be used only for compatibility purpose with another, older encoding system (Unicode is too concerned with backward compatibility, IMO).

That was an intention, but practically compatibility jamos are used everywhere. In the other words, modern Korean IMEs do operate at the combining jamo level before the commit, but the candidate text is materialized as compatibility jamos when they can't combine with each other.

The motivating example would be a Korean equivalent to "lol" and "*sob*": "ㅋㅋㅋㅋㅋㅋㅋㅋㅋ" and "ㅠㅠ". They are not intended to compose, so "ㅋㅋㅋㅋ" followed by "ㅠㅠ", once written, should not compose into "ㅋㅋㅋ큐ㅠ" like combining jamos. You can often see that composed form though because people often don't signal that fact (by explicitly committing the text) to the IME.