unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Consider supporting 1, 2, 4, and 24-bit trie values #4670

Open hsivonen opened 6 months ago

hsivonen commented 6 months ago

The trie builder always operates on 32-bit values and can then narrow the main backing array value to 8 or 16 bits at serialization time.

We already use a byte array as unaligned backing storage. We should consider extending the way the reads by index map to the backing byte array a little to support more compact value widths:

If the byte array had one extra byte at the end, we could use 32-bit unaligned loads to read 24-bit values (masking off the highest 8 bits) without going out of bounds. See also #4669.

For 1, 2, and 4-bit values, we could shift and mask the index to read smaller parts of bytes from an array that was 1/8, 1/4, or 1/2 in byte length compared to using 8 bits as the narrowest value.

1 bits is useful for accessing a binary property faster than from a fragmented inversion list. 2 bits is useful for bundling two co-occurring binary properties. 4 bits is useful for enumerated properties with few distinct values, e.g. Joining_Type. 24 bits is useful for scalar values.

sffc commented 1 month ago

Some thoughts:

sffc commented 1 month ago

I'll put this in the 2.0 milestone, but it isn't super-high priority and it could slip to 3.0.