Split DateSymbols data - Githubissues

Manishearth commented 1 year ago

DateSymbols is giant and has a lot of things inside it, only a fraction of which actually gets used once a formatter has been constructed.

We should split this type along day/month/year lines ,as well as along pattern length lines. (And provide a compatibility path for pre-2.0 V1 data, as usual)

Manishearth commented 1 year ago

@sffc and I discussed this a bunch, in the context of fixing https://github.com/unicode-org/icu4x/issues/3766 and https://github.com/unicode-org/icu4x/issues/3761, which involves adding more data to datetime anyway, which we don't want to V2 for without doing it right.

The rough proposal we had was that we have the following main symbols keys:

Years
- Either era symbols or cyclic year symbols. It does not make much sense for a calendar to have both, but if it does we can add a third variant
Months
- month symbols
Weekdays
Days (maybe):
- we can store day names as well if we end up having patterns for day names like those in Chinese or Hindu
- worth sketching out, should not be a part of the MVP, can be added retroactively in the future

The symbols keys use an auxiliary key (https://github.com/unicode-org/icu4x/issues/3632) to store the eight-way length distinction (abbreviated, narrow, short, wide) × (format, standalone). The current fallbacking between them will be performed either at datagen or via carefully done auxiliary key fallback (essentially, ensure that und is always empty for aux keys). See #3867.

Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).

For the rare pattern that needs multiple lengths to format something, we can store additional loaded data in an Option on the DateTimeFormat.

Finally, lengths would be as we have today, but they may also include a numbering system hint/override (eg hanidec/hanidays). This may potentially be per-field[^1], which may mean we potentially load multiple number formatters. Currently the overrides are hanidec, d=hanidays, hebr, M=romanlow, y=jpanyear, since we don't have RBNF yet I would recommend we just hardcode an enum for now and hardcode these numbering systems; it's not too hard to implement these in code and I think it's okay to do for such a small set.

cc @eggrobin who has thought about this a bit in the context of skeleta.

[^1]: E.g. in Chinese date formatting it is common to use hanidec or Latin for the year, hanidays for the day, and hans (spelled out Han) for the months. ICU4C currently handles this by using d=hanidays in the dateFormats.[length].numbers key and using month symbols to mimic hans.

Manishearth commented 1 year ago

cc @zbraniecki @robertbastian

Manishearth commented 1 year ago

Also our plan for https://github.com/unicode-org/icu4x/issues/3766 and https://github.com/unicode-org/icu4x/issues/3761 for 1.3 is to just let it slip and document the chinese calendar as being a preview calendar when it comes to formatting. We can clean up the placeholders and use mostly-correct placeholders instead.

Manishearth commented 1 year ago

Discussed a bit

@Manishearth - Leap month display is a bit of a mess; we can't use a string table because of numbering systems. A similar thing for cyclic years is that you need to pick from the 60 different year names, so we need to add more data (the 60 names). For leap months, either we need to include a pattern, or we need to include special data for numeric months. Right now we have a single DateTimeSymbols object, plus DateLengths. The longer term solution, documented in #3865, is that we split the date symbols data into smaller pieces: Years, which is either cyclic year symbols or era symbols (there is not currently a calendar that uses both, but we can support it later if needed); Months; Weekdays; and a Days key for day name formatting if we want to support that (ICU4C doesn't support it as far as we know). On top of this, we'll use aux keys to store the 8-way length distinction: Narrow, Abbreviated, Short, Wide, between Format and Standalone. We can load them via aux key fallback. Numeric becomes an additional data key that doesn't need to exist. We don't need to store leap year patterns then. There's a potential situation where you need multiple lengths to format a single field, like if a pattern contains both M and MMM; we can handle this with options. So this is the final design. However, this is not a design we can do for 1.3. I don't want to block 1.3 on this. Which means we have some short-term solutions:
1. Create new keys for leap months and cyclic years. The keys will be short-lived. We'll need to change the code again next time.
2. Hard-code the data for Chinese and Korean as a best-effort, and say that this functionality is preview.
3. Say that these calendars are unstable and we hide them.
@Manishearth - The main reason not to do 1 is that we have a better plan. Is there really value for doing the haphazard thing in 1.3? I think probably no.
@echeran - So that means that you'll get Chinese characters in the Chinese calendar even if that's not your locale?
@Manishearth - Yeah, which I think is okay. It's understandable, not ugly, and mostly correct for the main users.
@sffc - I think we should prioritize the good solution in 1.4 because we have users who need Chinese calendar. I want to get 1.3 out the door ASAP because it has things including compiled data. If we don't have time for the good solution in 1.4, then we could implement option 1 in 1.4.
@robertbastian - We need to write good docs that basically say that these calendars are experimental.

Conclusion: implement option 2 for 1.3.

LGTM: @manishearth @sffc @echeran (no strong opinion: @robertbastian, @skius)

Manishearth commented 1 year ago

Discussion between @sffc and I on whether we should use aux keys or regular keys for lengths. We didn't dive too deep into the hour cycle part since that is something that can be more easily tweaked later (whereas the lengths are pervasive).

The main benefit of using separate (regular) keys is that they enable more build time slicing: if you know in advance what lengths you'll need, you can slice things appropriately. However, since most ways of interacting with this will be via skeletons or overall lengths, this becomes a bit less easy to do with the layers of indirection. We could potentially design a highly typed API that datagens traits linking skeletons to keys, this feels like overkill. It seems like the main win is only if the user can specify exactly what lengths they want.

Separate keys also have the advantage of being slightly smaller in databake (though not blob), because instead of storing a massive locale lookup array it can store a much smaller lookup array that is deduplicated across keys (especially if we choose to resolve length fallback during datagen).

On the other hand, aux keys are cleaner (we don't end up with hundreds of symbols keys) and easier to deal with. In the long run we can experiment with various horizontal fallback options (see discussion in https://github.com/unicode-org/icu4x/issues/3867). There may also be options for optimization in the future by passing around binary search hints.

One major benefit is that users can slice out aux keys if they would like (we can do a very simple fallback algorithm in our code to handle this: if you don't find long, go check out medium, etc)

We decided to go with aux keys for now. We may measure things later and see if there are other benefits.

Manishearth commented 1 year ago

Listing out aux keys for each thing:

(a/n/s/w = abbr/narrow/short/wide, f/s = format/standalone)

Months: a/n/w × f/s
- Special key "numeric". We probably want a separate key for this, can be done later.
Weekdays: a/n/s/w × f/s
Quarters: a/n/w × f/s
(cyclic) Days: a/n/w × f/s
- I don't actually see anyone customizing standalone days.
Eras: names / abbr / narrow. @sffc is there any reason we should not just call "names" "wide" instead?
- They're called "wide" in the symbol table

Given that standalone is the more rare one I would recommend having key names be stuff like -x-a and -x-as (i.e. "format" is implicit). Keeps it short, and lets us easily add standalone keys in the future for stuff like days where we don't have any usage right now

sffc commented 1 year ago

Thought: we could use a digit corresponding to the number of symbols in https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table, like:

-x-3 = abbreviated
-x-4 = wide
-x-5 = narrow

And standalone could be

-x-3s = abbreviated
-x-4s = wide
-x-5s = narrow

or maybe

-x-f3 = format abbreviated
-x-f4 = format wide
-x-f5 = format narrow
-x-s3 = standalone abbreviated
-x-s4 = standalone wide
-x-s5 = standalone narrow

Manishearth commented 1 year ago

Makes sense. My instinct is to let format be the "default" because in some cases there is no data for standalone and we can save space by hardcoding that assumption in ICU4X and datagen (but tweaking it in a backcompat way if it changes)

Manishearth commented 1 year ago

The current design for DTF integration is that we load one of each type of field needed (one month symbol, etc).

If a pattern needs multiple fields, we can later add in a Map<Field, Box<dyn Any>> situation for storing extra fields.

sffc commented 1 year ago

Some initial numbers of postcard with different fallback modes. Number in parentheses is the point in the postcard file at which the sorted locale lookup VarZeroVec ends and the data table begins.

Key	Postcard, Runtime	Postcard, Hybrid	Postcard, Data Only
datetime/gregory/datesymbols@1	186558 (0x869)	190722 (0x15b5)	184405
datetime/symbols/gregory/years@1	30101 (0x2058)	49129 (0x60ac)	21821
datetime/symbols/gregory/months@1	105222 (0x4de1)	141505 (0xc936)	85285
datetime/symbols/weekdays@1	76988 (0x5b47)	129035 (0x10c42)	53621

The sum of the data only size of the three split keys is 160727, which is smaller than the 184405 in the single combined key. However, since the split keys require more locale lookup tables, the overall size is a bit larger. We are investigating ways to reduce the size of the locale lookup tables (e.g. #2699).

Example command line to generate one cell in the table: cargo run --release --bin icu4x-datagen -- --format blob --locales full --keys "datetime/symbols/gregory/months@1" -f runtime-manual

sffc commented 1 year ago

Very initial estimates for the impact of ZeroTrie on the postcard locale lookup table size, based on the strings in the compiled_data files (not the same set of locales as in the previous post):

Key	VZV, Runtime	ZT, Runtime
datetime/gregory/datesymbols@1	889	831
datetime/symbols/gregory/years@1	3935	2730
datetime/symbols/gregory/months@1	10223	5104
datetime/symbols/weekdays@1	10595	5096

So the bigger the VZV the bigger the win, with about a 50% win for the larger ones. If we project these ratios back to the full data set above, we stand to save something on the order of 25 kB in the sum of the split keys data size, which would bring the total split key size (runtime fallback mode, including lookup tables) down to just about the same as the combined key size.

sffc commented 1 year ago

I missed something in https://github.com/unicode-org/icu4x/issues/3865#issuecomment-1773976272. The lookup table is not only a VZV of locale strings; it is also a FZV of a mapping from the VZV index to the data blob index. With ZeroTrie we do not need that extra index-to-index table. If you include the extra table, the total lookup table size is about 15-20% higher than estimated. This means we should be able to cut an additional 5 kB by moving to ZeroTrie.

sffc commented 1 year ago

I implemented a ZeroTrie version of BlobSchema in #4207. Results for Gregorian, runtime fallback, and all locales:

Data Key	Postcard Size
datesymbols	185248
months	90017
weekdays	58578
years	24893

The new keys are 173488 bytes total, now including locale lookup metadata, smaller than the combined key. 😃

Manishearth commented 1 year ago

Numeric becomes an optional auxiliary key (like another type of length) for calendars that have special formatting for numeric months (with leap year patterns), days, etc. We attempt to load it during construction but do not error if it is not found. We do not store leap year patterns for calendars that generate leap year names in a pattern based way (Chinese, Dangi).

Update: For months we're going to shove leap month formatting info onto the existing months key; hopefully it doesn't change data size.

For pattern numeric overrides we have a couple potential designs.

Some information before we dive in. Currently there are only a couple numeric overrides in use:

$ rg _numbers --no-filename | sort -u
                "_numbers": "d=hanidays"
                "_numbers": "hanidec"
                "_numbers": "hebr"
                "_numbers": "M=romanlow"
                "_numbers": "y=jpanyear"

and they're only found for dateFormats (and skeletons):

Keys that currently use numeric overrides

```text $ rg -l "_numbers" | xargs jq -c 'paths | select(.[-1] == "_numbers")' ["main","haw","dates","calendars","roc","dateFormats","short","_numbers"] ["main","haw","dates","calendars","roc","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","coptic","dateFormats","short","_numbers"] ["main","haw","dates","calendars","coptic","dateSkeletons","short","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","full","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","long","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","japanese","dateSkeletons","medium","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","yue","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","yue","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","yue-Hant","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","yue-Hant","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","haw","dates","calendars","indian","dateFormats","short","_numbers"] ["main","haw","dates","calendars","indian","dateSkeletons","short","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","full","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","long","_numbers"] ["main","ja","dates","calendars","japanese","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","japanese","dateSkeletons","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","ja","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","haw","dates","calendars","buddhist","dateFormats","short","_numbers"] ["main","haw","dates","calendars","buddhist","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","persian","dateFormats","short","_numbers"] ["main","haw","dates","calendars","persian","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","japanese","dateFormats","short","_numbers"] ["main","haw","dates","calendars","japanese","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","ethiopic-amete-alem","dateFormats","short","_numbers"] ["main","haw","dates","calendars","ethiopic-amete-alem","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","ethiopic","dateFormats","short","_numbers"] ["main","haw","dates","calendars","ethiopic","dateSkeletons","short","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","yue","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","yue","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","dangi","dateSkeletons","medium","_numbers"] ["main","yue-Hant","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","yue-Hant","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh-Hans","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","full","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","long","_numbers"] ["main","zh","dates","calendars","dangi","dateFormats","medium","_numbers"] ["main","zh","dates","calendars","dangi","dateSkeletons","full","_numbers"] ["main","zh","dates","calendars","dangi","dateSkeletons","long","_numbers"] ["main","haw","dates","calendars","generic","dateFormats","short","_numbers"] ["main","haw","dates","calendars","generic","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","gregorian","dateFormats","short","_numbers"] ["main","haw","dates","calendars","gregorian","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","hebrew","dateFormats","short","_numbers"] ["main","haw","dates","calendars","hebrew","dateSkeletons","short","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","full","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","long","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","medium","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","short","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","full","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","long","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","medium","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","short","_numbers"] ["main","yi","dates","calendars","hebrew","dateFormats","full","_numbers"] ["main","yi","dates","calendars","hebrew","dateFormats","long","_numbers"] ["main","yi","dates","calendars","hebrew","dateFormats","medium","_numbers"] ["main","yi","dates","calendars","hebrew","dateFormats","short","_numbers"] ["main","yi","dates","calendars","hebrew","dateSkeletons","full","_numbers"] ["main","yi","dates","calendars","hebrew","dateSkeletons","long","_numbers"] ["main","yi","dates","calendars","hebrew","dateSkeletons","medium","_numbers"] ["main","yi","dates","calendars","hebrew","dateSkeletons","short","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","full","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","long","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","medium","_numbers"] ["main","he","dates","calendars","hebrew","dateFormats","short","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","full","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","long","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","medium","_numbers"] ["main","he","dates","calendars","hebrew","dateSkeletons","short","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-SG","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hant","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","ja","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","ja","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-MO","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue-Hant","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh-Hans-HK","dates","calendars","chinese","dateSkeletons","medium","_numbers"] ["main","haw","dates","calendars","islamic","dateFormats","short","_numbers"] ["main","haw","dates","calendars","islamic","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","islamic-civil","dateFormats","short","_numbers"] ["main","haw","dates","calendars","islamic-civil","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","islamic-rgsa","dateFormats","short","_numbers"] ["main","haw","dates","calendars","islamic-rgsa","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","islamic-tbla","dateFormats","short","_numbers"] ["main","haw","dates","calendars","islamic-tbla","dateSkeletons","short","_numbers"] ["main","haw","dates","calendars","islamic-umalqura","dateFormats","short","_numbers"] ["main","haw","dates","calendars","islamic-umalqura","dateSkeletons","short","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh-Hans","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","yue-Hans","dates","calendars","chinese","dateSkeletons","long","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","full","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","long","_numbers"] ["main","zh","dates","calendars","chinese","dateFormats","medium","_numbers"] ["main","zh","dates","calendars","chinese","dateSkeletons","full","_numbers"] ["main","zh","dates","calendars","chinese","dateSkeletons","long","_numbers"] ```

The design I'm thinking about is one where we have a datetime/patterns/<cal>/date/numeric@1 key with the same 4-6 aux subtags, however it is sparse. It contains a single enum that has variants for all known CLDR numeric overrides, which can be expanded as needed (can even contain variants representing things like d=hanidec,M=romanlow if it ever comes to it)

For a calendar-locale-length combination, if CLDR has a _numbers key (note that CLDR sometimes has _numbers for only a subset of lengths!) then we generate this data for that length; otherwise we do not. icu_datetime will attempt to load day/numeric@ for the corresponding length and resolved locale [^1] for the already loaded data, but not be perturbed if it can't find it.

@sffc has an alternate solution: We treat these as new field values. In other words, rU年MMMd + d=hanidays is treated as something like rU年MMM{d=hanidays}, perhaps serialized as rU年MMMdddddddddd or something similar (start at a large number and add variants upwards).

We do have plenty of space in FieldLength to store this, we could have FieldLength::Sixteen onwards be our internal things, or alternatively have it be FieldLength::CustomNumeric(CustomNumericFormat) (and have it map to a number between 16 and 127)

[^1] There's a slight wrinkle here in case of FallbackMode::Runtime: if, say, zh-SG uses the same date patterns as zh-CN but also overrides the number format (or chooses not to!) the resolved locale after fallback will not work right. This may need special handling in the runtime mode code.

Manishearth commented 1 year ago

After convincing myself that we do have plenty of space I'm comfortable going with @sffc's option, since it doesn't increase data size at all except in JSON mode.

If people wish to come up with alternate serialization schemes like the rU年MMM{d=hanidays} or rU年MMMd{hanidays} we are open to suggestions!

sffc commented 1 year ago

datetime/patterns/<cal>/date/numeric@1

What goes into this patterns key? Do you mean a symbol key like datetime/symbols/<cal>/months/numeric@1?

Manishearth commented 1 year ago

sorry, yes, symbols

sffc commented 1 year ago

icu_datetime will attempt to load day/numeric@ for the corresponding length and resolved locale

The other problem with this is that fallback isn't free. We should do things like this for less-common cases where we can improve code clarity, but numeric formatting of days, months, and years is a very hot path and I don't want to waste time doing a locale fallback.

Manishearth commented 1 year ago

Yeah that's a good point

Manishearth commented 1 year ago

Split out numeric symbols stuff in https://github.com/unicode-org/icu4x/issues/4242

Manishearth commented 1 month ago

This is done in neo.

unicode-org / icu4x

Split DateSymbols data #3865