Closed Manishearth closed 1 month ago
@sffc and I discussed this a bunch, in the context of fixing https://github.com/unicode-org/icu4x/issues/3766 and https://github.com/unicode-org/icu4x/issues/3761, which involves adding more data to datetime anyway, which we don't want to V2 for without doing it right.
The rough proposal we had was that we have the following main symbols keys:
The symbols keys use an auxiliary key (https://github.com/unicode-org/icu4x/issues/3632) to store the eight-way length distinction (abbreviated, narrow, short, wide) × (format, standalone). The current fallbacking between them will be performed either at datagen or via carefully done auxiliary key fallback (essentially, ensure that und
is always empty for aux keys). See #3867.
Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).
For the rare pattern that needs multiple lengths to format something, we can store additional loaded data in an Option on the DateTimeFormat.
Finally, lengths
would be as we have today, but they may also include a numbering system hint/override (eg hanidec/hanidays). This may potentially be per-field[^1], which may mean we potentially load multiple number formatters. Currently the overrides are hanidec
, d=hanidays
, hebr
, M=romanlow
, y=jpanyear
, since we don't have RBNF yet I would recommend we just hardcode an enum for now and hardcode these numbering systems; it's not too hard to implement these in code and I think it's okay to do for such a small set.
cc @eggrobin who has thought about this a bit in the context of skeleta.
[^1]: E.g. in Chinese date formatting it is common to use hanidec or Latin for the year, hanidays for the day, and hans (spelled out Han) for the months. ICU4C currently handles this by using d=hanidays
in the dateFormats.[length].numbers
key and using month symbols to mimic hans
.
cc @zbraniecki @robertbastian
Also our plan for https://github.com/unicode-org/icu4x/issues/3766 and https://github.com/unicode-org/icu4x/issues/3761 for 1.3 is to just let it slip and document the chinese calendar as being a preview calendar when it comes to formatting. We can clean up the placeholders and use mostly-correct placeholders instead.
Discussed a bit
M
and MMM
; we can handle this with options. So this is the final design. However, this is not a design we can do for 1.3. I don't want to block 1.3 on this. Which means we have some short-term solutions:
Conclusion: implement option 2 for 1.3.
LGTM: @manishearth @sffc @echeran (no strong opinion: @robertbastian, @skius)
Discussion between @sffc and I on whether we should use aux keys or regular keys for lengths. We didn't dive too deep into the hour cycle part since that is something that can be more easily tweaked later (whereas the lengths are pervasive).
The main benefit of using separate (regular) keys is that they enable more build time slicing: if you know in advance what lengths you'll need, you can slice things appropriately. However, since most ways of interacting with this will be via skeletons or overall lengths, this becomes a bit less easy to do with the layers of indirection. We could potentially design a highly typed API that datagens traits linking skeletons to keys, this feels like overkill. It seems like the main win is only if the user can specify exactly what lengths they want.
Separate keys also have the advantage of being slightly smaller in databake (though not blob), because instead of storing a massive locale lookup array it can store a much smaller lookup array that is deduplicated across keys (especially if we choose to resolve length fallback during datagen).
On the other hand, aux keys are cleaner (we don't end up with hundreds of symbols keys) and easier to deal with. In the long run we can experiment with various horizontal fallback options (see discussion in https://github.com/unicode-org/icu4x/issues/3867). There may also be options for optimization in the future by passing around binary search hints.
One major benefit is that users can slice out aux keys if they would like (we can do a very simple fallback algorithm in our code to handle this: if you don't find long, go check out medium, etc)
We decided to go with aux keys for now. We may measure things later and see if there are other benefits.
Listing out aux keys for each thing:
(a/n/s/w = abbr/narrow/short/wide, f/s = format/standalone)
Given that standalone is the more rare one I would recommend having key names be stuff like -x-a
and -x-as
(i.e. "format" is implicit). Keeps it short, and lets us easily add standalone keys in the future for stuff like days where we don't have any usage right now
Thought: we could use a digit corresponding to the number of symbols in https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table, like:
-x-3
= abbreviated-x-4
= wide-x-5
= narrowAnd standalone could be
-x-3s
= abbreviated-x-4s
= wide-x-5s
= narrowor maybe
-x-f3
= format abbreviated-x-f4
= format wide-x-f5
= format narrow-x-s3
= standalone abbreviated-x-s4
= standalone wide-x-s5
= standalone narrowMakes sense. My instinct is to let format be the "default" because in some cases there is no data for standalone and we can save space by hardcoding that assumption in ICU4X and datagen (but tweaking it in a backcompat way if it changes)
The current design for DTF integration is that we load one of each type of field needed (one month symbol, etc).
If a pattern needs multiple fields, we can later add in a Map<Field, Box<dyn Any>>
situation for storing extra fields.
Some initial numbers of postcard with different fallback modes. Number in parentheses is the point in the postcard file at which the sorted locale lookup VarZeroVec ends and the data table begins.
Key | Postcard, Runtime | Postcard, Hybrid | Postcard, Data Only |
---|---|---|---|
datetime/gregory/datesymbols@1 | 186558 (0x869) | 190722 (0x15b5) | 184405 |
datetime/symbols/gregory/years@1 | 30101 (0x2058) | 49129 (0x60ac) | 21821 |
datetime/symbols/gregory/months@1 | 105222 (0x4de1) | 141505 (0xc936) | 85285 |
datetime/symbols/weekdays@1 | 76988 (0x5b47) | 129035 (0x10c42) | 53621 |
The sum of the data only size of the three split keys is 160727, which is smaller than the 184405 in the single combined key. However, since the split keys require more locale lookup tables, the overall size is a bit larger. We are investigating ways to reduce the size of the locale lookup tables (e.g. #2699).
Example command line to generate one cell in the table: cargo run --release --bin icu4x-datagen -- --format blob --locales full --keys "datetime/symbols/gregory/months@1" -f runtime-manual
Very initial estimates for the impact of ZeroTrie on the postcard locale lookup table size, based on the strings in the compiled_data files (not the same set of locales as in the previous post):
Key | VZV, Runtime | ZT, Runtime |
---|---|---|
datetime/gregory/datesymbols@1 | 889 | 831 |
datetime/symbols/gregory/years@1 | 3935 | 2730 |
datetime/symbols/gregory/months@1 | 10223 | 5104 |
datetime/symbols/weekdays@1 | 10595 | 5096 |
So the bigger the VZV the bigger the win, with about a 50% win for the larger ones. If we project these ratios back to the full data set above, we stand to save something on the order of 25 kB in the sum of the split keys data size, which would bring the total split key size (runtime fallback mode, including lookup tables) down to just about the same as the combined key size.
I missed something in https://github.com/unicode-org/icu4x/issues/3865#issuecomment-1773976272. The lookup table is not only a VZV of locale strings; it is also a FZV of a mapping from the VZV index to the data blob index. With ZeroTrie we do not need that extra index-to-index table. If you include the extra table, the total lookup table size is about 15-20% higher than estimated. This means we should be able to cut an additional 5 kB by moving to ZeroTrie.
I implemented a ZeroTrie version of BlobSchema in #4207. Results for Gregorian, runtime fallback, and all locales:
Data Key | Postcard Size |
---|---|
datesymbols | 185248 |
months | 90017 |
weekdays | 58578 |
years | 24893 |
The new keys are 173488 bytes total, now including locale lookup metadata, smaller than the combined key. 😃
Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).
Update: For months we're going to shove leap month formatting info onto the existing months key; hopefully it doesn't change data size.
For pattern numeric overrides we have a couple potential designs.
Some information before we dive in. Currently there are only a couple numeric overrides in use:
$ rg _numbers --no-filename | sort -u
"_numbers": "d=hanidays"
"_numbers": "hanidec"
"_numbers": "hebr"
"_numbers": "M=romanlow"
"_numbers": "y=jpanyear"
and they're only found for dateFormats (and skeletons):
The design I'm thinking about is one where we have a datetime/patterns/<cal>/date/numeric@1
key with the same 4-6 aux subtags, however it is sparse. It contains a single enum that has variants for all known CLDR numeric overrides, which can be expanded as needed (can even contain variants representing things like d=hanidec,M=romanlow
if it ever comes to it)
For a calendar-locale-length combination, if CLDR has a _numbers
key (note that CLDR sometimes has _numbers
for only a subset of lengths!) then we generate this data for that length; otherwise we do not. icu_datetime
will attempt to load day/numeric@
for the corresponding length and resolved locale [^1] for the already loaded data, but not be perturbed if it can't find it.
@sffc has an alternate solution: We treat these as new field values. In other words, rU年MMMd
+ d=hanidays
is treated as something like rU年MMM{d=hanidays}
, perhaps serialized as rU年MMMdddddddddd
or something similar (start at a large number and add variants upwards).
We do have plenty of space in FieldLength
to store this, we could have FieldLength::Sixteen
onwards be our internal things, or alternatively have it be FieldLength::CustomNumeric(CustomNumericFormat)
(and have it map to a number between 16 and 127)
[^1] There's a slight wrinkle here in case of FallbackMode::Runtime
: if, say, zh-SG uses the same date patterns as zh-CN but also overrides the number format (or chooses not to!) the resolved locale after fallback will not work right. This may need special handling in the runtime mode code.
After convincing myself that we do have plenty of space I'm comfortable going with @sffc's option, since it doesn't increase data size at all except in JSON mode.
If people wish to come up with alternate serialization schemes like the rU年MMM{d=hanidays}
or rU年MMMd{hanidays}
we are open to suggestions!
datetime/patterns/<cal>/date/numeric@1
What goes into this patterns key? Do you mean a symbol key like datetime/symbols/<cal>/months/numeric@1
?
sorry, yes, symbols
icu_datetime will attempt to load day/numeric@ for the corresponding length and resolved locale
The other problem with this is that fallback isn't free. We should do things like this for less-common cases where we can improve code clarity, but numeric formatting of days, months, and years is a very hot path and I don't want to waste time doing a locale fallback.
Yeah that's a good point
Split out numeric symbols stuff in https://github.com/unicode-org/icu4x/issues/4242
This is done in neo.
DateSymbols is giant and has a lot of things inside it, only a fraction of which actually gets used once a formatter has been constructed.
We should split this type along day/month/year lines ,as well as along pattern length lines. (And provide a compatibility path for pre-2.0 V1 data, as usual)