unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Create an Ideal Components Bag / Skeleton for DateTimeFormat #1317

Open gregtatum opened 3 years ago

gregtatum commented 3 years ago

This is a meta issue to track implementing the "ideal components bag" as laid out in the DateTimeFormat Deep Dive 2021-10-01 design document. Originally there was some discussion to have this replace the current components bag, but it is to be implemented alongside the existing components bag. A better name can be bikeshed if needed.

The following need to be completed.

sffc commented 2 years ago

@gregtatum will provide mentorship.

zbraniecki commented 2 years ago

I'm spreading the word about this issue looking for candidates.

More details:

Description: Currently, DateTimeFormat has two ways to select the right format, both of them are imperfect. We believe we have a balanced novel solution that, once implemented, will become the foundational use of the DateTimeFormat. Scope: We believe that the initial implementation should take one person several (2-3) months to implement. Hopefully in time for ICU4X 1.0. Mentorship: This project is well staffed on the mentorship side with @gregtatum from Mozilla, @sffc from Google and @zbraniecki from Amazon willing to invest time to mentor the engineer who'll pick it. How to start: If you are interested in the project, comment in this issue or join unicode-org.slack.com #icu4x and we'll get you on-ramped.

ozghimire commented 2 years ago

I'm interested to work on this issue.

zbraniecki commented 2 years ago

@gregtatum are you still open to mentor?

pdogr commented 2 years ago

If this issue is still open, I'm definitely interested to work on this.

gregtatum commented 2 years ago

@ozghimire Great! How would you prefer to get started? There is a document linked above outlining the strategy which should discuss how to get things going. I would suggest starting with #1318. I will fill in more details on that issue.

@pdogr I think ozghimire is taking the first step on this to move it forward, and it's hard to parallelize this initial step, but there will probably be work to help out on around the issues. You could take another DateTimeFormat issue to get onboarded. I'm sure there will be opportunities to help in the short term. #1581 would be a good bug to onboard with if you wanted to take it.

randomicon00 commented 2 years ago

Hello @gregtatum, are still looking for contributors?

bdarcus commented 1 year ago

Not sure if you're looking for feedback, but if there's way you could improve the user-friendliness of the config, that would be really helpful. I'm a Rust newbie, and was pretty confused on how to use this feature. So if I do something like this, I get errors about a long list of missing fields:

    let const DAYMONTH = components::Bag {
        year: Numeric,
        month: Short,
    }

Here's the equivalent in JS:

const date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0));

const options1 = {
  month: "long",
  day: "numeric",
};

const options2 = {
  month: "short",
  day: "numeric",
};

const options3 = {
  month: "numeric",
  day: "numeric",
};

console.log("Option 1:")
console.log(date.toLocaleString("de-DE", options1));
console.log(date.toLocaleString("en-US", options1));
console.log(date.toLocaleString("es-PE", options1));
console.log("")
console.log("Option 2:")
console.log(date.toLocaleString("de-DE", options2));
console.log(date.toLocaleString("en-US", options2));
console.log(date.toLocaleString("es-PE", options2));
console.log("")
console.log("Option 3:")
console.log(date.toLocaleString("de-DE", options3));
console.log(date.toLocaleString("en-US", options3));
console.log(date.toLocaleString("es-PE", options3));

... which generates:

> "Option 1:"
> "19. Dezember"
> "December 19"
> "19 de diciembre"
> ""
> "Option 2:"
> "19. Dez."
> "Dec 19"
> "19 dic."
> ""
> "Option 3:"
> "19.12."
> "12/19"
> "19/12"
sffc commented 1 year ago

@bdarcus Until https://github.com/rust-lang/rust/issues/70564 lands, you need to create a mutable components bag and then set your fields on it.

let mut components_bag = components::Bag::default();
components_bag.year = components::Numeric::TwoDigit;
components_bag.month = components::Month::Short;
Manishearth commented 1 year ago

@sffc, @eggrobin, and I discussed @eggrobin's WIP skeleton work at https://github.com/unicode-org/icu4x/compare/main...eggrobin:icu4x:%CF%83%CE%BA%CE%B5%CE%BB%CE%B5%CF%84%CE%AC (https://github.com/unicode-org/icu4x/commit/716672be5bb7b6d3adf136d14fdeaf40c196b982 + previous commits)

The general plan moving forward is that skeleta will be represented using @eggrobin's design which essentially boils down to an enum for (day, time, date, datetime) + a length, plus additional timezone stuff. This makes for 12 possible time components, 9 day components, 8 non-day date components (so 17 total date components), and 12 × 17 combined DateTime components, with three lengths for each.

There is a fallback algorithm that CLDR uses, which is implemented in ICU4X as get_best_available_format_pattern. We move this to datagen and perform the simpler subset of the fallback algorithm that falls back between lengths. In other words, we always generate data for each of the 12/17 skeleta and use the fallback algorithm to find suitable replacement data when not present, but we do not necessarily generate data for each of the lengths.

For the data model, we can store Date/Time/DateTime as separate keys, with the first three having a data model of:

/// For Date/Time only, not datetime
struct PackedSkeletonData<'data> {
   pub indices: ZeroVec<'data, SkeletonDataIndex>, // len = 12 for time, 17 for date
   pub patterns: VarZeroVec<'data, PatternPluralsULE>,
}

// conceptually:
// {
//   has_long: bool,
//   has_medium: bool,
//   has_short: bool,
//   index: u16, // index of first pattern (long if present, else medium, else short)
// }
#[derive(ULE)]
struct SkeletonDataIndex(u16);

struct DateTimeSkeletons<'data> {
   // will typically be small, there are only a couple special cases like E B h m
   map: ZeroMap<'data, Skeleton, PatternPluralsULE>, 
}

For date or time lookup, based on the skeleton we index into the indices array and perform fallback on the available lengths in the metadata. The data is stored contiguously as [long?, medium?, short?] so we can calculate its index by offsetting from the base index, and then fetching.

For datetime lookup, we first index into the DateTimeSkeletons map, and if not present, we then go fetch the individual date and time data and glue them together using the glue from the datetime lengths data.

Manishearth commented 1 year ago

When we fix this we should also fix https://github.com/unicode-org/icu4x/issues/3762

sffc commented 1 year ago
Key:
[all have full]
- has_long
- has_medium
- has_short
[all have other]
- has_zero
- has_one
- has_two
- has_few
- has_many

Or another model:

[all have full]
- has_long
- has_medium
- has_short
- has_six_plurals

Or make a model that stores different sets of plurals in only 2 bits.

@sffc and @Manishearth to work on this after finishing neo symbols.

sffc commented 8 months ago

https://github.com/unicode-org/icu4x/issues/1317#issuecomment-1623963015 is a design for how to store skeletons in the data file, but it doesn't directly address the question of knowing ahead of time which names and name lengths to include.

With semantic skeleta, are any of the following invariants true (across all locales and calendars)?

  1. If the skeleton does not have Weekday, then the pattern does not have Weekday.
  2. If the skeleton has a short Weekday, then the pattern has a short Weekday.
  3. If the skeleton has a long Weekday, then the pattern has a long Weekday.
  4. If the skeleton does not have Month, then the pattern does not have Month.
  5. If the skeleton has a numeric Month, then the pattern has a numeric Month. (trick question! I know this one happens to be false in the Hebrew calendar)
  6. If the skeleton has a short spellout Month, then the pattern has a short spellout Month
  7. If the skeleton has a long spellout Month, then the patterh has a long spellout Month

And similar for Era, Day Period, and Time Zone.

Depending on which of these invariants work out, we should be able to have static analysis of a skeleton to produce an auto-sliced data bundle.

sffc commented 8 months ago

My understanding from @eggrobin on the above questions is:

  1. Including Day-of-Month could imply including Weekday, because they are both ways of representing specific dates
  2. Including Month does not imply including Weekday, because a weekday represents a specific date, not a month
  3. Can't make any guarantees about the width of the fields
sffc commented 4 months ago

Notes from this topic in the ICU4X-TC meeting on 2024-07-11:

https://docs.google.com/presentation/d/1qXxBv4DVnqfBSpGt9ikVQLk9M0LX65O9lWvDo0pH9SU/edit#slide=id.p

Statement seeking consensus:

  1. The ICU4X-TC approves the overall design of the ICU4X Rust code implementing semantic skeleta.
  2. If CLDR-TC approves semantic skeleta for LDML 46, the ICU4X-TC approves replacing the existing date time formatting classes with the new semantic skeleta formatting classes in the ICU4X 2.0 release.
sffc commented 3 months ago

Working on the combined date/time pattern overrides.

All classical skeleta across all locales and calendars in CLDR (as a comma-separated list):

Bh,Bhm,Bhms,d,E,EBhm,EBhms,Ed,Ehm,Ehm-alt-ascii,EHm,Ehms,Ehms-alt-ascii,EHms,Gy,GyMd,GyMMM,GyMMMd,GyMMMEd,h,h-alt-ascii,H,hm,hm-alt-ascii,Hm,hms,hms-alt-ascii,Hms,hmsv,hmsv-alt-ascii,Hmsv,hmv,hmv-alt-ascii,Hmv,M,Md,MEd,MMM,MMMd,MMMEd,MMMMd,MMMMW-count-one,MMMMW-count-other,ms,y,yM,yMd,yMEd,yMMM,yMMMd,yMMMEd,yMMMM,yQQQ,yQQQQ,yw-count-one,yw-count-other,MMMMEd,MMdd,MMMMW-count-zero,MMMMW-count-two,MMMMW-count-few,MMMMW-count-many,yMM,yw-count-zero,yw-count-two,yw-count-few,yw-count-many,GyMMMM,GyMMMMd,GyMMMMEd,MMMM,MMMMdd,yMMMMd,yMMMMEd,MMd,hmsvvvv,Hmsvvvv,hmvvvv,Hmvvvv,yQ,yMMdd,GyMMMEEEEd,MMMEEEEd,MMMMEEEEd,yMMMEEEEd,yMMMMEEEEd,Md-alt-variant,MEd-alt-variant,MMdd-alt-variant,yMd-alt-variant,yMEd-alt-variant,MMMdd,mmss,HHmmZ,yMMMMccccd,EEEEd,MEEEEd,yMEEEEd,HHmmss,Mdd,Hmm,HHmm,GyM,yyyy,yyyyM,yyyyMd,yyyyMEd,yyyyMMM,yyyyMMMd,yyyyMMMEd,yyyyMMMM,yyyyQQQ,yyyyQQQQ,yyyyMMMMd,yyyyMMMMEd,yyyyMM,yyyyMMMEEEEd,yyyyMMMMEEEEd,yyyyMd-alt-variant,yyyyMEd-alt-variant,yyyyMMMMccccd,yyyyMMdd,yyyyMEEEEd,GyMEEEEd,UM,UMd,UMMM,UMMMd,UMd-alt-variant,HmZ

Of those, the ones that span date/time/zone are:

EBhm EBhms Ehm EHm Ehms EHms hmsv Hmsv hmv Hmv hmsvvvv Hmsvvvv hmvvvv Hmvvvv HHmmZ HmZ

In order to keep a fixed set of auxiliary keys known at compile time, I will proceed to hard-code and generate those sets. I will store them only when they differ from the glue-based pattern. I will also open an issue to add a test that will inform us if any more such skeleta get added in the future.

sffc commented 2 months ago

I'm going to delete the benches for the old code soon, but here is a snapshot of how they currently compare, running on the same data but with different APIs:

datetime/zoned_datetime_overview
                        time:   [32.527 µs 32.563 µs 32.616 µs]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

datetime/neoneo/datetime_zoned_datetime_overview
                        time:   [32.503 µs 32.565 µs 32.635 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

datetime/datetime_components
                        time:   [365.04 µs 365.44 µs 365.89 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

datetime/neoneo/datetime_components
                        time:   [322.48 µs 323.54 µs 325.27 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

datetime/datetime_lengths
                        time:   [36.500 µs 36.572 µs 36.669 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

datetime/neoneo/datetime_lengths
                        time:   [38.260 µs 38.286 µs 38.312 µs]

In words, peformance on lengths and lengths_with_zones is basically identical, and performance on components is improved from 365 to 323 (a 12% improvement).

The primary metrics that my work has been focused on are memory use and binary/data size, so it's nice to see that we could achieve those while also getting a small improvement on benchmark performance as well.

sffc commented 2 months ago

Some quick binary size analysis after migrating tui to neo datetime:

Compilation: cargo +nightly-2024-07-23 build -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort --target x86_64-unknown-linux-gnu --profile=release-opt-size --examples --workspace --features serde --exclude icu_provider_export --exclude icu_capi

Binary size breakdown of tui (including data):

1.x DateTime Neo DateTime
Total Binary Size 4,661,520 1,598,032
Code Size 104,401 92,683
Data Size (.rodata) 2,933,616 1,199,840
Data Size (.data.rel.ro) 1,030,024 146,344

How I calculated this: I opened both binaries in IDA Pro and measured the length of the segments.

What is .rodata vs .data.rel.ro: I gather that .rodata are the pure read-only binary data such as static strings (and I see some human-readable strings in that section), whereas .data.rel.ro are the stack frames of static variables, i.e. the baked data structs.

Why the numbers don't add up: the ELF contains some sections which are neither pure code nor pure data. The biggest section not included in the numbers above is used for Relocation.

The reduced data is because previously the binary was including data it didn't need for formatting, such as weekday names and certain types of time zone names, and because the new pattern layout is more efficient. I'm also pleased to see the reduced code size, which is probably due to less branching overall.

sffc commented 2 months ago

Discussion with @robertbastian (mostly obsolete given #5538):

Suggestion for 2.0 FFI:

enum DateFieldSets { }
enum CalendarPeriodFieldSets { }
enum TimeFieldSets { }
enum ZoneFieldSets { }

struct CompositeFieldSet {
    date: Option<DateFieldSets>,
    // ...
}

struct DateTimeOptionsV1 {
    length: Option<Length>,
    year_style: Option<YearStyle>,
    fractional_second_digits: Option<FractionalSecondDigits>,
    // ...
}

// ALL NEW!
opaque DateTimeFormatter {
    pub fn create(&Locale, CompositeFieldSet, DateTimeOptionsV1) -> Result<>
}

// ALL NEW!
opaque GregorianDateTimeFormatter {
    pub fn create(&Locale, CompositeFieldSet, DateTimeOptionsV1) -> Result<>
}

opaque DateFormatter { ... } // icu::datetime::Formatter<YMD>

opaque GregorianDateFormatter { ... } // icu::datetime::Formatter<YMD>

opaque TimeFormatter { ... } // icu::datetime::Formatter<TimeComponents>

Types removed:

Note: I want to add more field-set-specific types, but @Manishearth suggests doing this after we have better namespacing post-2.0, and I agree.

sffc commented 1 month ago

How should I organize the code in the repo? I know this is banal, but I need to do something. What we currently have is a bit of a mess, and we need to move things anyway as part of the big rename, so I may as well move them into good places.

Currently, inside components/datetime/src:

Path Description Visibility
fields/*.rs Field and its related types Public
format/datetime.rs Core formatting logic Private
format/neo.rs DateTimeNames, DateTimePatternFormatter Private with public re-exported types
options/*.rs 1.x formatter options Datagen-only except for the HourCycle type, which we could replace with the one from icu_preferences
pattern/*.rs Internal pattern types and logic doc(hidden)
provider/calendar/*.rs 1.x data structs Mix of public and datagen
provider/neo/*.rs 2.x data structs Public
provider/packed_pattern.rs Packed pattern data struct, recently landed Public
provider/time_zones.rs Time zone data structs Public
raw/neo.rs DateTimeZonePatternSelectionData and its related types Private
skeleton/*.rs Classical skeleton code Datagen
calendar.rs CldrCalendar and related traits and trait impls Private with public re-exported types
error.rs MismatchedCalendarError Public
external_loaders.rs Fixed decimal formatter and calendar data loader helpers Private
helpers.rs size_test macro Private
input.rs ExtractedInput Private
lib.rs Nothing defined, only re-exports Public
neo_marker.rs Declarations and definitions of 2.x traits; field set markers Public
neo_pattern.rs DateTimePattern Public
neo_serde.rs Serde impls for 2.x things Private
neo_skeleton.rs Enums and structs for semantic skeleta Public
neo.rs Main 2.x formatter types Public
time_zone.rs Time zone formatting Private
tz_registry.rs Time zone format registry: mapping between semantic time zones, resolved time zones, and field-based time zones Private

All re-exports are from the root unless otherwise specified.

What I think I want to move:

Destination File Name Stuff to move inside Visibility
names.rs DateTimeNames Private with public re-exported types
dt_pattern.rs DateTimePattern, DateTimePatternFormatter Private with public re-exported types
raw.rs What is currently raw/neo.rs Private
error.rs All error enums throughout the crate Public/private as needed
fieldset.rs Field set markers Public
scaffolding/*.rs 2.x formatting traits. Also move CldrCalendar and friends in here Public, but nothing in this module should show up in normal usage of the API
skeleton_impl/*.rs Rename of skeleton/*.rs Datagen
skeleton.rs Enums and structs for semantic skeleta Public
formatter.rs Main 2.x formatter types Private with public re-exported types

Note: I want everything to have exactly 1 place where it is exported.

Thoughts/approval? @Manishearth @robertbastian

Manishearth commented 1 month ago

This seems fine! I think we should be organized but it's flexible and we don't have to get it perfect right now. Something vaguely sensible is enough for me!

sffc commented 1 month ago

Type naming discussion:

Points brought up:

Discussion:

Conclusion:

Agreed: @sffc @manishearth @robertbastian

sffc commented 3 weeks ago

Review with Zibi:

fieldset!([year, month, day])::medium() => YMD::medium()

DateTimePattern verbiage:

Original: Most clients should use DateTimeFormatter instead of directly formatting with patterns.

[DateTimePattern] forgoes most internationalization functionality of the datetime crate. It assumes that the pattern is already localized for the customer's locale. Most clients should use [DateTimeFormatter] instead of directly formatting with patterns.

Type exports:

icu::datetime::pattern::DateTimeNames
icu::datetime::pattern::DateTimePattern
icu::datetime::pattern::DateTimePatternFormatter

On the filesystem:

Errors:

Manishearth commented 3 weeks ago

fieldset!([year, month, day])::medium() => YMD::medium()

Let's not use macros in type position, they don't work in every possible type position and this gets annoying quickly. I think it's fine to provide such a macro but having a fallback is good.

sffc commented 3 weeks ago

Notes from brief discussion with @Manishearth:

The user-facing field set related types can be put into 4 buckets

  1. Compile-time field sets such as struct YMD { options }
  2. Runtime field sets such as enum DateFieldSet { YMD, ... }
  3. Runtime skeletons such as struct DateSkeleton { field_set: DateFieldSet, options }
  4. The options themselves, such as Alignment or FractionalSecondDigits

Where should these all go?

My original idea (Option 1):

  1. icu::datetime::fieldset::YMD
  2. icu::datetime::skeleton::DateFieldSet
  3. icu::datetime::skeleton::DateSkeleton
  4. icu::datetime::skeleton::Alignment

One that @Manishearth suggested (Option 2):

  1. icu::datetime::fieldset::YMD
  2. icu::datetime::fieldset::runtime::DateFieldSet
  3. icu::datetime::skeleton::DateSkeleton
  4. icu::datetime::options::Alignment

Here's another one that might be good (Option 3):

  1. icu::datetime::fieldset::YMD
  2. icu::datetime::fieldset::DateFieldSet
  3. icu::datetime::options::DateFieldSetWithOptions
  4. icu::datetime::options::Alignment

I'm not sure about options::DateFieldSetWithOptions. I could still put it at skeleton::DateSkeleton. But skeleton sounds more important, but in the ICU4X world, it's this thing that most people shouldn't generally always be using.

It also occurs to me that skeleton and scaffold are kind-of similar words, but they mean different things.

Manishearth commented 3 weeks ago

To add some points, personally I think the ideal situation is that the fieldsets, runtime fieldsets, skeleta are all in their own modules, containing nothing but those types, combiner types (Combo), and potentially other modules. Exactly how that is achieved can be done in multiple ways, with fieldset and fieldset::runtime or fieldset and fieldset_runtime, and with skeletons being a submodule of fieldsets or options or something. No strong opinion there.

My vision is that each of these (except for options) can have strong documentation about the usage of these things that the rest of the crate can link to.

Manishearth commented 3 weeks ago

@sffc's current thinking:

@Manishearth But then we have two dimensions in the same module:

(Date, Time, DateTime, ZonedDateTime)(Skeleton, Fieldset)

@sffc We could reduce it to one type, like this:

// mod fieldset_dynamic
enum DateFieldSet {
    YMD(fieldset::YMD),
    MD(fieldset::MD),
}

@Manishearth But currently it's nice to write:

let skeleton = NeoDateSkeleton {YearMonthDay, YearStyle::whatever};

and this gets a bit more complicated on construction? depends on what the user patterns would be like.

Proposal:

LGTM: @sffc @Manishearth

sffc commented 2 weeks ago

I made a proposal to switch around how time fields are handled:

https://docs.google.com/document/d/1SkxoitlCFiQ_KGW3dmRk7lbumGd_N1rfYJLMlkiXvh4/edit?tab=t.0

I started implementing this in ICU4X. My idea:

  1. Change the API to reflect the new enum
  2. Change the time data payloads to contain 3 patterns, variants of Hour, Hour+Minute, and Hour+Minute+Second using the same mechanism that we have for distinguishing the three year styles
  3. Reduce down to 3 time data payloads: default hour cycle, h12, and h24
  4. Reduce down to a single time field, but otherwise keep the current traits working as they are

Then, I will update #5761 to switch around the dynamic field sets as discussed previously.

One small caveat: I realized that overlap patterns already use the variants for year style. However, none of our overlap patterns currently contain the year field, so I'm currently adding a debug assertion and changing the variants there to be for time precision instead.

Does this sound okay @Manishearth?

Manishearth commented 2 weeks ago

:+1: