unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Split numbering systems out of decimal data #5822

Closed Manishearth closed 6 days ago

Manishearth commented 1 week ago

Fixes https://github.com/unicode-org/icu4x/issues/5818

Before:

 decimal/symbols@2, <lookup>, 1316B, 252 identifiers
 decimal/symbols@2, <total>, 4308B, 2436B, 49 unique payloads

After:

 decimal/digits@1, <lookup>, 207B, 27 identifiers
 decimal/digits@1, <total>, 1080B, 1060B, 27 unique payloads
 decimal/symbols@2, <lookup>, 804B, 184 identifiers
 decimal/symbols@2, <total>, 1749B, 881B, 31 unique payloads

Saving ~1.5kB, a good third of the data size. A lot of the wins are just in deduplication.

I'm also going to try moving the tinystr into the VarZeroVec and seeing what happens.

I may also try and store the digits more compactly as an enum { Sequential(char), Many(ZeroVec<char>) }. A downside of this is that the Sequential case would need UTF8 validation every time, though we could make it so that that's just the wire format and we expand to a digit array on data load.

Todo: add configurability for this.

@sffc In the long run CompactDecimal / etc should also be using this data. In that case, should we just always generate all known decimal systems? How would the unification work across keys?

Manishearth commented 1 week ago

Digits becomes much larger in "all" mode. I've added code for that but not hooked it in yet.

 decimal/digits@1, <lookup>, 550B, 77 identifiers
 decimal/digits@1, <total>, 3080B, 3420B, 77 unique payloads
sffc commented 1 week ago

@sffc In the long run CompactDecimal / etc should also be using this data. In that case, should we just always generate all known decimal systems? How would the unification work across keys?

I don't really understand the question? CompactDecimalFormatter depends on FixedDecimalFormatter and so it should already be using the new data markers.

Manishearth commented 1 week ago

Ah, we lose out on some of our wins when we handle the fact that the symbols data can differ for a given locale between numbering systems.

 decimal/symbols@2, <lookup>, 1316B, 252 identifiers
 decimal/symbols@2, <total>, 2740B, 1368B, 49 unique payloads