Open hsivonen opened 1 year ago
Oops. I missed the sentence "One sequence is defined for each CJK compatibility ideograph in the Unicode Standard." My spot-checking then doesn't make sense, and reasoning from compatibility ideographs in general makes sense.
It seems somewhat unfortunate that the same mechanism applies both to characters that KS X 1001 encoded multiple times due to different readings and to characters that source standards considered distinct by appearance.
For the former, it seems that today's practice is rejecting the notion of encoding same-looking characters multiple times by multiple readings: In particular, phonetic collations in CLDR admit one collation-sensitive reading per ideograph.
For the latter, a relevant question is: In what scenario does a normalization process actually encounter Compatibility Ideograph input in practice? Theory would say "after conversion to Unicode from EUC-TW", but EUC-TW didn't end up getting mainstream adoption. When it's important to maintain the distinction of character appearance, does input to normalization already exist variation sequence form authored in a variation sequence-aware Unicode environment as opposed to existing compatibility ideograph form (which wouldn't survive plain normalization)?
Upon comparing the feature set of the
unicode-normalization
crate with the feature set oficu_normalizer
, I discovered thatunicode-normalization
supports mapping CJK Compatibility Ideographs to Standardized Variants.Unicode 15.0 says (page 932; PDF page 958):
Technically,
icu_normalizer
could support this mapping followed by NFD or this mapping followed by NFC by representing this mapping as aDecompositionSupplementV1
with an associatedDecompositionTablesV1
. Somewhat unfortunately, the mappings to two BMP characters would still be stored inDecompositionTablesV1
, since the in-trie pairs are reserved for the case where the canonical combining class of the second character is non-zero. It might be worthwhile to consider if it would make sense to relax that invariant for supplements, which are never used with the collator. (The invariant is collator-motivated in the first place.)As for use cases, I spot-checked the IRG source of a handful of the compatibility characters. I saw one KP-source character. Other than that, the BMP ones that I happened to check were K-source and the Plane 2 ones were T-source from the higher planes of CNS 11643. Given the usage ratio of Hangul vs. Hanja for the Korean language and the higher planes of CNS 11643 being rare for Traditional Chinese, without proper domain expertise, this feature seems to me more like a historical-text-relevant feature than modern-text-relevant feature, but I'd appreciate a characterization by someone with domain expertise.
Across GitHub, I found 3 users of this feature in
unicode-normalization
:unicode-normalization
feature)