Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

Other

1.38k stars 178 forks source link

Trie type small and fast differ in how they handle BMP code points above 0xFFF. The characters in this range that can be very frequent in text (as opposed to e.g. General Punctuation) and that affect a particularly large number of users are CJK characters.

Of these, it is common for all CJK Unified Ideographs to have a single trie value in common and for all Hangul Syllables to have (another) single trie value in common.

Thus, if trie type small special-cased these after branching on the 0xFFF bound but before running the general above-0xFFF code, chances are that most of the performance benefit of trie type fast could be had with trie type small (for Japanese this would obviously only speed up kanji but not kana).

Unfortunately, there isn't an exact one way of how this could go.

In the normalizer case, CJK Unified Ideographs get the default trie value of 0, but the range could be extended to also cover code points before and after the precise CJK Unified Ideographs range. Also, the caller code could statically provide the wanted trie value.

In the normalizer case, Hangul Syllables get the trie value of 1 and the caller could statically provide this value.

In the collator case, the trie is never queried for Hangul Syllables, so it's not worthwhile to branch on that range in the trie code. Also, CJK Unified Ideographs gets a constant trie value in the root collation in the implicithan case, but 1) the exact value cannot be hard-coded in the caller (as it is chosen by the data builder and could change across CLDR versions) and 2) there isn't a constant for the whole block in the typical CJK tailorings (which should probably be built with trie type fast even if the root has trie type small) or in the unihan-mode root.

Working group discussion:

@hsivonen The idea is that when we are looking up CJK ideographs or Hangul syllables, instead of having them in separate tries, we could combine them, and the lookup after the first branch to see if the code point is less than 0xFFF, and then the next branch could be to check whether it's in the Hangul range or not (which means it is otherwise a CJK ideograph), and that would potentially give the speed benefits of a fast trie, but for data that otherwise would need extra branching that a small trie requires.
@sffc This is is in the same category as normalize_str,in which after this new check/branch proposed, there's no more branching, correct?
@hsivonen No. We would need to include the combination of start, length, and value, both for range 1 and range 2. Maybe this would be an API hack. If we then wanted to take this to the Collator, then it would be an internal implementation detail.
@sffc OK so basically this is creating another special CodePointTrie API for Normalizer.
@hsivonen We could start with the hack for Normalizer first, yes.
@sffc It's not without cost. With the extra branching, there is extra code size and maintenance costs. It's not clear that the cost is worth it. You could just use trie type "fast".
@hsivonen If you look at the sizes for the NFK and UTS46 tries, there is a fairly substantial difference.

unicode-org / icu4x

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820