The current zero-copy format of transliterators allows multiple representations for the same transliterator (this is impossible to avoid in general), which means we can try and optimize certain metrics. This issue presents a few possible optimizations (and examples of that optimization in parentheses).
[ ] Single-element UnicodeSets could be flattened+inlined into literals ([a] => 'a')
[ ] Deduplicate special constructs in the VarTable ([a-z] { x } [a-z] > X only stores [a-z] in VarTable once and reuses it)
[ ] Variable inlining (see below)
Variable inlining
Given a variable with an (encoded) definition of C bytes (i.e., in UTF-8 in the zero-copy format) that is used elsewhere n times, we have the following:
The cost of having a variable definition is 2 + 4*n + C bytes. 2 because we add an element to a VarTableVarZeroVec which requires a 16-bit index, 4*n because every use of the variable will refer to it using a private-use code point from the PUP (15), which is 4 bytes in UTF-8, and C because that's the size of the definition we will store in the VarTable.
The cost of inlining every use of the variable is n*C bytes. We copy the full definition of length C to all n uses of the variable.
Thus, if 2 + 4*n + C > n * C, inlining this variable at all use-sites gives us a size reduction (and reduced indirection!) of 2 + 4*n (1-n)*C. Note that a greedy algorithm will already give us these benefits at little developer-cost, but a more sophisticated algorithm could perform even better in the case of variables being used within variables.
As a special case of the above computation, inlining variables whose encoded definition takes up 4 or less bytes is always an improvement because referring to the variable costs 4 bytes[^1].
Example
$ident = '$' [a-z] [a-z0-9]+ ; # C = 9, 1 for $, 4 for [a-z], and 4 for ([a-z0-9]+)
$ident ',' > ;
'"' $ident '"' > ... ;
Cost of having the variable definition 2 + 4*2 + 9 = 19, cost of inlining: 2*9 = 18, therefore we would save a byte when inlining. However, adding a third use of $ident would lead to four added bytes when inlining vs. keeping the variable.
Next step: Wait until at an initial version of transliteration has landed and we have reference data struct sizes, then discuss priority of this/whether we want to do this even.
[^1]: Again under the assumption we use the private use plane, whose code points encode to 4 UTF-8 bytes
The current zero-copy format of transliterators allows multiple representations for the same transliterator (this is impossible to avoid in general), which means we can try and optimize certain metrics. This issue presents a few possible optimizations (and examples of that optimization in parentheses).
[a]
=>'a'
)VarTable
([a-z] { x } [a-z] > X
only stores[a-z]
inVarTable
once and reuses it)Variable inlining
Given a variable with an (encoded) definition of
C
bytes (i.e., in UTF-8 in the zero-copy format) that is used elsewheren
times, we have the following:2 + 4*n + C
bytes.2
because we add an element to aVarTable
VarZeroVec
which requires a 16-bit index,4*n
because every use of the variable will refer to it using a private-use code point from the PUP (15), which is4
bytes in UTF-8, andC
because that's the size of the definition we will store in theVarTable
.n*C
bytes. We copy the full definition of lengthC
to alln
uses of the variable.Thus, if
2 + 4*n + C > n * C
, inlining this variable at all use-sites gives us a size reduction (and reduced indirection!) of2 + 4*n (1-n)*C
. Note that a greedy algorithm will already give us these benefits at little developer-cost, but a more sophisticated algorithm could perform even better in the case of variables being used within variables.As a special case of the above computation, inlining variables whose encoded definition takes up 4 or less bytes is always an improvement because referring to the variable costs 4 bytes[^1].
Example
Cost of having the variable definition
2 + 4*2 + 9 = 19
, cost of inlining:2*9 = 18
, therefore we would save a byte when inlining. However, adding a third use of$ident
would lead to four added bytes when inlining vs. keeping the variable.Next step: Wait until at an initial version of transliteration has landed and we have reference data struct sizes, then discuss priority of this/whether we want to do this even.
[^1]: Again under the assumption we use the private use plane, whose code points encode to 4 UTF-8 bytes