Apply transliterator data struct optimizations at datagen-time

The current zero-copy format of transliterators allows multiple representations for the same transliterator (this is impossible to avoid in general), which means we can try and optimize certain metrics. This issue presents a few possible optimizations (and examples of that optimization in parentheses).

[ ] Single-element UnicodeSets could be flattened+inlined into literals ([a] => 'a')
[ ] Deduplicate special constructs in the VarTable ([a-z] { x } [a-z] > X only stores [a-z] in VarTable once and reuses it)
[ ] Variable inlining (see below)

Variable inlining

Given a variable with an (encoded) definition of C bytes (i.e., in UTF-8 in the zero-copy format) that is used elsewhere n times, we have the following:

The cost of having a variable definition is 2 + 4*n + C bytes. 2 because we add an element to a VarTable VarZeroVec which requires a 16-bit index, 4*n because every use of the variable will refer to it using a private-use code point from the PUP (15), which is 4 bytes in UTF-8, and C because that's the size of the definition we will store in the VarTable.
The cost of inlining every use of the variable is n*C bytes. We copy the full definition of length C to all n uses of the variable.

Thus, if 2 + 4*n + C > n * C, inlining this variable at all use-sites gives us a size reduction (and reduced indirection!) of 2 + 4*n (1-n)*C. Note that a greedy algorithm will already give us these benefits at little developer-cost, but a more sophisticated algorithm could perform even better in the case of variables being used within variables.

As a special case of the above computation, inlining variables whose encoded definition takes up 4 or less bytes is always an improvement because referring to the variable costs 4 bytes[^1].

Example

$ident = '$' [a-z] [a-z0-9]+ ;   # C = 9, 1 for $, 4 for [a-z], and 4 for ([a-z0-9]+)
$ident ',' > ;
'"' $ident '"' > ... ;

Cost of having the variable definition 2 + 4*2 + 9 = 19, cost of inlining: 2*9 = 18, therefore we would save a byte when inlining. However, adding a third use of $ident would lead to four added bytes when inlining vs. keeping the variable.

Next step: Wait until at an initial version of transliteration has landed and we have reference data struct sizes, then discuss priority of this/whether we want to do this even.

[^1]: Again under the assumption we use the private use plane, whose code points encode to 4 UTF-8 bytes

unicode-org / icu4x

Apply transliterator data struct optimizations at datagen-time #3825

Variable inlining

Example