unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

Open hsivonen opened 1 week ago

hsivonen commented 1 week ago

Trie type small and fast differ in how they handle BMP code points above 0xFFF. The characters in this range that can be very frequent in text (as opposed to e.g. General Punctuation) and that affect a particularly large number of users are CJK characters.

Of these, it is common for all CJK Unified Ideographs to have a single trie value in common and for all Hangul Syllables to have (another) single trie value in common.

Thus, if trie type small special-cased these after branching on the 0xFFF bound but before running the general above-0xFFF code, chances are that most of the performance benefit of trie type fast could be had with trie type small (for Japanese this would obviously only speed up kanji but not kana).

Unfortunately, there isn't an exact one way of how this could go.

In the normalizer case, CJK Unified Ideographs get the default trie value of 0, but the range could be extended to also cover code points before and after the precise CJK Unified Ideographs range. Also, the caller code could statically provide the wanted trie value.

In the normalizer case, Hangul Syllables get the trie value of 1 and the caller could statically provide this value.

In the collator case, the trie is never queried for Hangul Syllables, so it's not worthwhile to branch on that range in the trie code. Also, CJK Unified Ideographs gets a constant trie value in the root collation in the implicithan case, but 1) the exact value cannot be hard-coded in the caller (as it is chosen by the data builder and could change across CLDR versions) and 2) there isn't a constant for the whole block in the typical CJK tailorings (which should probably be built with trie type fast even if the root has trie type small) or in the unihan-mode root.

sffc commented 1 week ago

Working group discussion: