unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Baked data is bigger than postcard data #5429

Open sffc opened 3 months ago

sffc commented 3 months ago

I computed fingerprints.csv based on both baked_size and postcard_size.

Baked is equal in size or bigger than postcard for every data marker. A selection of the biggest offenders by overall size or percentage:

Marker Path Postcard Size Baked Size Growth
list/or@2 3780B 31520B 734%
plurals/ranges@1 64B 524B 719%
percent/essentials@1 415B 2945B 610%
decimal/symbols@1 2040B 8905B 337%
relativetime/narrow/second@1 7678B 26032B 239%
units/displaynames@1 963404B 2062721B 114%
currency/extended@1 1908533B 3134831B 64.3%
displaynames/languages@1 1521724B 1557050B 2.32%

The good news is that many of these keys will be improved under #5230 or #5379.

Should we do anything?

@robertbastian @Manishearth

robertbastian commented 3 months ago

For list the explanation is that the data struct contains 10 cows (4 patterns of 1 cow, and two conditions of 3 cows each), but usually only encodes tiny texts (,, and, etc.). Unit and And data doesn't show up because the Spanish/Hebrew regexes equalise things between baked and postcard. So it's a similar problem to decimal formatter.

Manishearth commented 3 months ago

"baked size" here is the size of the .rs file, yes?

robertbastian commented 3 months ago

no, an estimate for in-memory size, ignoring &'static deduplication

zbraniecki commented 3 months ago

in-memory size means actual PSS cost? Can we also calculate on-disk binary size impact?

robertbastian commented 3 months ago

If you tell me what PSS is I might be able to answer this

sffc commented 3 months ago

list/and@1 also showed up, but I didn't take the time to copy it into the table; I tried to include a representative cross-section in the OP.

"baked size" refers to the in-memory size based on the bake_size, which is core::mem::size_of plus borrows_size.

These numbers are roughly reflective of what happens when I compile ICU4X with the compiled_data feature versus when I build Postcard data with icu4x-datagen. compiled_data produces a larger binary than no-default-features with postcard.