unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.39k stars 180 forks source link

Apply transliterator data struct optimizations at datagen-time #3825

Open skius opened 1 year ago

skius commented 1 year ago

The current zero-copy format of transliterators allows multiple representations for the same transliterator (this is impossible to avoid in general), which means we can try and optimize certain metrics. This issue presents a few possible optimizations (and examples of that optimization in parentheses).

Variable inlining

Given a variable with an (encoded) definition of C bytes (i.e., in UTF-8 in the zero-copy format) that is used elsewhere n times, we have the following:

Thus, if 2 + 4*n + C > n * C, inlining this variable at all use-sites gives us a size reduction (and reduced indirection!) of 2 + 4*n (1-n)*C. Note that a greedy algorithm will already give us these benefits at little developer-cost, but a more sophisticated algorithm could perform even better in the case of variables being used within variables.

As a special case of the above computation, inlining variables whose encoded definition takes up 4 or less bytes is always an improvement because referring to the variable costs 4 bytes[^1].

Example
$ident = '$' [a-z] [a-z0-9]+ ;   # C = 9, 1 for $, 4 for [a-z], and 4 for ([a-z0-9]+)
$ident ',' > ;
'"' $ident '"' > ... ;

Cost of having the variable definition 2 + 4*2 + 9 = 19, cost of inlining: 2*9 = 18, therefore we would save a byte when inlining. However, adding a third use of $ident would lead to four added bytes when inlining vs. keeping the variable.


Next step: Wait until at an initial version of transliteration has landed and we have reference data struct sizes, then discuss priority of this/whether we want to do this even.

[^1]: Again under the assumption we use the private use plane, whose code points encode to 4 UTF-8 bytes

macchiati commented 1 year ago

Nice analysis