Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs

hsivonen commented 1 year ago

Upon comparing the feature set of the unicode-normalization crate with the feature set of icu_normalizer, I discovered that unicode-normalization supports mapping CJK Compatibility Ideographs to Standardized Variants.

Unicode 15.0 says (page 932; PDF page 958):

CJK Compatibility Ideographs. There are 1,002 standardized variation sequences for CJK compatibility ideographs. One sequence is defined for each CJK compatibility ideograph in the Unicode Standard. These sequences are defined to address a normalization issue for these ideographs.

Implementations or users sometimes need a CJK compatibility ideograph to be distinct from its corresponding CJK unified ideograph. For example, a distinct glyphic form may be expected for a particular text. However, CJK compatibility ideographs have canonical equivalence mappings to their corresponding CJK unified ideograph, which means that such distinctions are lost whenever Unicode normalization is applied. Using the variation sequence preserves the distinction found in the original, non-normalized text, even when normalization is later applied.

Because variation sequences are not affected by Unicode normalization, an implementa- tion which uses the corresponding standardized variation sequence can safely maintain the intended distinction for that CJK compatibility ideograph, even in plain text.

It is important to distinguish standardized variation sequences for CJK compatibility ideo- graphs from the variation sequences that are registered in the Ideographic Variation Data- base (IVD). The former are normalization-stable representations of the CJK compatibility ideographs; they are defined in StandardizedVariants.txt, and there is precisely one varia- tion sequence for each CJK compatibility ideograph. The latter are also stable under nor- malization, but correspond to implementation-specific glyphs in a registry entry.

Technically, icu_normalizer could support this mapping followed by NFD or this mapping followed by NFC by representing this mapping as a DecompositionSupplementV1 with an associated DecompositionTablesV1. Somewhat unfortunately, the mappings to two BMP characters would still be stored in DecompositionTablesV1, since the in-trie pairs are reserved for the case where the canonical combining class of the second character is non-zero. It might be worthwhile to consider if it would make sense to relax that invariant for supplements, which are never used with the collator. (The invariant is collator-motivated in the first place.)

As for use cases, I spot-checked the IRG source of a handful of the compatibility characters. I saw one KP-source character. Other than that, the BMP ones that I happened to check were K-source and the Plane 2 ones were T-source from the higher planes of CNS 11643. Given the usage ratio of Hangul vs. Hanja for the Korean language and the higher planes of CNS 11643 being rare for Traditional Chinese, without proper domain expertise, this feature seems to me more like a historical-text-relevant feature than modern-text-relevant feature, but I'd appreciate a characterization by someone with domain expertise.

Across GitHub, I found 3 users of this feature in unicode-normalization:

https://github.com/sunfishcode/basic-text (by the implementor of the unicode-normalization feature)
https://github.com/logannc/fuzzywuzzy-rs (unclear to me why you'd want this for a fuzzy match; I'd expect a fuzzy match not to want to distinguish the variations)
https://github.com/crlf0710/runestr-rs

hsivonen commented 1 year ago

Oops. I missed the sentence "One sequence is defined for each CJK compatibility ideograph in the Unicode Standard." My spot-checking then doesn't make sense, and reasoning from compatibility ideographs in general makes sense.

hsivonen commented 1 year ago

It seems somewhat unfortunate that the same mechanism applies both to characters that KS X 1001 encoded multiple times due to different readings and to characters that source standards considered distinct by appearance.

For the former, it seems that today's practice is rejecting the notion of encoding same-looking characters multiple times by multiple readings: In particular, phonetic collations in CLDR admit one collation-sensitive reading per ideograph.

For the latter, a relevant question is: In what scenario does a normalization process actually encounter Compatibility Ideograph input in practice? Theory would say "after conversion to Unicode from EUC-TW", but EUC-TW didn't end up getting mainstream adoption. When it's important to maintain the distinction of character appearance, does input to normalization already exist variation sequence form authored in a variation sequence-aware Unicode environment as opposed to existing compatibility ideograph form (which wouldn't survive plain normalization)?

unicode-org / icu4x

Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs #2886