Open gregtatum opened 3 years ago
@gregtatum will provide mentorship.
I'm spreading the word about this issue looking for candidates.
More details:
Description
: Currently, DateTimeFormat has two ways to select the right format, both of them are imperfect. We believe we have a balanced novel solution that, once implemented, will become the foundational use of the DateTimeFormat.
Scope
: We believe that the initial implementation should take one person several (2-3) months to implement. Hopefully in time for ICU4X 1.0.
Mentorship
: This project is well staffed on the mentorship side with @gregtatum from Mozilla, @sffc from Google and @zbraniecki from Amazon willing to invest time to mentor the engineer who'll pick it.
How to start
: If you are interested in the project, comment in this issue or join unicode-org.slack.com #icu4x and we'll get you on-ramped.
I'm interested to work on this issue.
@gregtatum are you still open to mentor?
If this issue is still open, I'm definitely interested to work on this.
@ozghimire Great! How would you prefer to get started? There is a document linked above outlining the strategy which should discuss how to get things going. I would suggest starting with #1318. I will fill in more details on that issue.
@pdogr I think ozghimire is taking the first step on this to move it forward, and it's hard to parallelize this initial step, but there will probably be work to help out on around the issues. You could take another DateTimeFormat issue to get onboarded. I'm sure there will be opportunities to help in the short term. #1581 would be a good bug to onboard with if you wanted to take it.
Hello @gregtatum, are still looking for contributors?
Not sure if you're looking for feedback, but if there's way you could improve the user-friendliness of the config, that would be really helpful. I'm a Rust newbie, and was pretty confused on how to use this feature. So if I do something like this, I get errors about a long list of missing fields:
let const DAYMONTH = components::Bag {
year: Numeric,
month: Short,
}
Here's the equivalent in JS:
const date = new Date(Date.UTC(2012, 11, 20, 3, 0, 0));
const options1 = {
month: "long",
day: "numeric",
};
const options2 = {
month: "short",
day: "numeric",
};
const options3 = {
month: "numeric",
day: "numeric",
};
console.log("Option 1:")
console.log(date.toLocaleString("de-DE", options1));
console.log(date.toLocaleString("en-US", options1));
console.log(date.toLocaleString("es-PE", options1));
console.log("")
console.log("Option 2:")
console.log(date.toLocaleString("de-DE", options2));
console.log(date.toLocaleString("en-US", options2));
console.log(date.toLocaleString("es-PE", options2));
console.log("")
console.log("Option 3:")
console.log(date.toLocaleString("de-DE", options3));
console.log(date.toLocaleString("en-US", options3));
console.log(date.toLocaleString("es-PE", options3));
... which generates:
> "Option 1:"
> "19. Dezember"
> "December 19"
> "19 de diciembre"
> ""
> "Option 2:"
> "19. Dez."
> "Dec 19"
> "19 dic."
> ""
> "Option 3:"
> "19.12."
> "12/19"
> "19/12"
@bdarcus Until https://github.com/rust-lang/rust/issues/70564 lands, you need to create a mutable components bag and then set your fields on it.
let mut components_bag = components::Bag::default();
components_bag.year = components::Numeric::TwoDigit;
components_bag.month = components::Month::Short;
@sffc, @eggrobin, and I discussed @eggrobin's WIP skeleton work at https://github.com/unicode-org/icu4x/compare/main...eggrobin:icu4x:%CF%83%CE%BA%CE%B5%CE%BB%CE%B5%CF%84%CE%AC (https://github.com/unicode-org/icu4x/commit/716672be5bb7b6d3adf136d14fdeaf40c196b982 + previous commits)
The general plan moving forward is that skeleta will be represented using @eggrobin's design which essentially boils down to an enum for (day, time, date, datetime) + a length, plus additional timezone stuff. This makes for 12 possible time components, 9 day components, 8 non-day date components (so 17 total date components), and 12 × 17 combined DateTime components, with three lengths for each.
There is a fallback algorithm that CLDR uses, which is implemented in ICU4X as get_best_available_format_pattern
. We move this to datagen and perform the simpler subset of the fallback algorithm that falls back between lengths. In other words, we always generate data for each of the 12/17 skeleta and use the fallback algorithm to find suitable replacement data when not present, but we do not necessarily generate data for each of the lengths.
For the data model, we can store Date/Time/DateTime as separate keys, with the first three having a data model of:
/// For Date/Time only, not datetime
struct PackedSkeletonData<'data> {
pub indices: ZeroVec<'data, SkeletonDataIndex>, // len = 12 for time, 17 for date
pub patterns: VarZeroVec<'data, PatternPluralsULE>,
}
// conceptually:
// {
// has_long: bool,
// has_medium: bool,
// has_short: bool,
// index: u16, // index of first pattern (long if present, else medium, else short)
// }
#[derive(ULE)]
struct SkeletonDataIndex(u16);
struct DateTimeSkeletons<'data> {
// will typically be small, there are only a couple special cases like E B h m
map: ZeroMap<'data, Skeleton, PatternPluralsULE>,
}
For date or time lookup, based on the skeleton we index into the indices
array and perform fallback on the available lengths in the metadata. The data is stored contiguously as [long?, medium?, short?]
so we can calculate its index by offsetting from the base index, and then fetching.
For datetime lookup, we first index into the DateTimeSkeletons
map, and if not present, we then go fetch the individual date and time data and glue them together using the glue from the datetime lengths data.
When we fix this we should also fix https://github.com/unicode-org/icu4x/issues/3762
[12][K][V][K2][V2]
Key:
[all have full]
- has_long
- has_medium
- has_short
[all have other]
- has_zero
- has_one
- has_two
- has_few
- has_many
Or another model:
[all have full]
- has_long
- has_medium
- has_short
- has_six_plurals
Or make a model that stores different sets of plurals in only 2 bits.
@sffc and @Manishearth to work on this after finishing neo symbols.
https://github.com/unicode-org/icu4x/issues/1317#issuecomment-1623963015 is a design for how to store skeletons in the data file, but it doesn't directly address the question of knowing ahead of time which names and name lengths to include.
With semantic skeleta, are any of the following invariants true (across all locales and calendars)?
And similar for Era, Day Period, and Time Zone.
Depending on which of these invariants work out, we should be able to have static analysis of a skeleton to produce an auto-sliced data bundle.
My understanding from @eggrobin on the above questions is:
Notes from this topic in the ICU4X-TC meeting on 2024-07-11:
https://docs.google.com/presentation/d/1qXxBv4DVnqfBSpGt9ikVQLk9M0LX65O9lWvDo0pH9SU/edit#slide=id.p
LLLL
, and NeoFormatter will get that. It also defines standalone weekday (maybe c?).Short
says: A short date; typically numeric, as in “1/1/2000”Statement seeking consensus:
Working on the combined date/time pattern overrides.
All classical skeleta across all locales and calendars in CLDR (as a comma-separated list):
Bh,Bhm,Bhms,d,E,EBhm,EBhms,Ed,Ehm,Ehm-alt-ascii,EHm,Ehms,Ehms-alt-ascii,EHms,Gy,GyMd,GyMMM,GyMMMd,GyMMMEd,h,h-alt-ascii,H,hm,hm-alt-ascii,Hm,hms,hms-alt-ascii,Hms,hmsv,hmsv-alt-ascii,Hmsv,hmv,hmv-alt-ascii,Hmv,M,Md,MEd,MMM,MMMd,MMMEd,MMMMd,MMMMW-count-one,MMMMW-count-other,ms,y,yM,yMd,yMEd,yMMM,yMMMd,yMMMEd,yMMMM,yQQQ,yQQQQ,yw-count-one,yw-count-other,MMMMEd,MMdd,MMMMW-count-zero,MMMMW-count-two,MMMMW-count-few,MMMMW-count-many,yMM,yw-count-zero,yw-count-two,yw-count-few,yw-count-many,GyMMMM,GyMMMMd,GyMMMMEd,MMMM,MMMMdd,yMMMMd,yMMMMEd,MMd,hmsvvvv,Hmsvvvv,hmvvvv,Hmvvvv,yQ,yMMdd,GyMMMEEEEd,MMMEEEEd,MMMMEEEEd,yMMMEEEEd,yMMMMEEEEd,Md-alt-variant,MEd-alt-variant,MMdd-alt-variant,yMd-alt-variant,yMEd-alt-variant,MMMdd,mmss,HHmmZ,yMMMMccccd,EEEEd,MEEEEd,yMEEEEd,HHmmss,Mdd,Hmm,HHmm,GyM,yyyy,yyyyM,yyyyMd,yyyyMEd,yyyyMMM,yyyyMMMd,yyyyMMMEd,yyyyMMMM,yyyyQQQ,yyyyQQQQ,yyyyMMMMd,yyyyMMMMEd,yyyyMM,yyyyMMMEEEEd,yyyyMMMMEEEEd,yyyyMd-alt-variant,yyyyMEd-alt-variant,yyyyMMMMccccd,yyyyMMdd,yyyyMEEEEd,GyMEEEEd,UM,UMd,UMMM,UMMMd,UMd-alt-variant,HmZ
Of those, the ones that span date/time/zone are:
EBhm EBhms Ehm EHm Ehms EHms hmsv Hmsv hmv Hmv hmsvvvv Hmsvvvv hmvvvv Hmvvvv HHmmZ HmZ
In order to keep a fixed set of auxiliary keys known at compile time, I will proceed to hard-code and generate those sets. I will store them only when they differ from the glue-based pattern. I will also open an issue to add a test that will inform us if any more such skeleta get added in the future.
I'm going to delete the benches for the old code soon, but here is a snapshot of how they currently compare, running on the same data but with different APIs:
datetime/zoned_datetime_overview
time: [32.527 µs 32.563 µs 32.616 µs]
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
datetime/neoneo/datetime_zoned_datetime_overview
time: [32.503 µs 32.565 µs 32.635 µs]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
datetime/datetime_components
time: [365.04 µs 365.44 µs 365.89 µs]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
datetime/neoneo/datetime_components
time: [322.48 µs 323.54 µs 325.27 µs]
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
datetime/datetime_lengths
time: [36.500 µs 36.572 µs 36.669 µs]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe
datetime/neoneo/datetime_lengths
time: [38.260 µs 38.286 µs 38.312 µs]
In words, peformance on lengths
and lengths_with_zones
is basically identical, and performance on components
is improved from 365 to 323 (a 12% improvement).
The primary metrics that my work has been focused on are memory use and binary/data size, so it's nice to see that we could achieve those while also getting a small improvement on benchmark performance as well.
Some quick binary size analysis after migrating tui
to neo datetime:
Compilation: cargo +nightly-2024-07-23 build -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort --target x86_64-unknown-linux-gnu --profile=release-opt-size --examples --workspace --features serde --exclude icu_provider_export --exclude icu_capi
Binary size breakdown of tui
(including data):
1.x DateTime | Neo DateTime | |
---|---|---|
Total Binary Size | 4,661,520 | 1,598,032 |
Code Size | 104,401 | 92,683 |
Data Size (.rodata) | 2,933,616 | 1,199,840 |
Data Size (.data.rel.ro) | 1,030,024 | 146,344 |
How I calculated this: I opened both binaries in IDA Pro and measured the length of the segments.
What is .rodata vs .data.rel.ro: I gather that .rodata
are the pure read-only binary data such as static strings (and I see some human-readable strings in that section), whereas .data.rel.ro
are the stack frames of static variables, i.e. the baked data structs.
Why the numbers don't add up: the ELF contains some sections which are neither pure code nor pure data. The biggest section not included in the numbers above is used for Relocation.
The reduced data is because previously the binary was including data it didn't need for formatting, such as weekday names and certain types of time zone names, and because the new pattern layout is more efficient. I'm also pleased to see the reduced code size, which is probably due to less branching overall.
Discussion with @robertbastian (mostly obsolete given #5538):
Suggestion for 2.0 FFI:
enum DateFieldSets { }
enum CalendarPeriodFieldSets { }
enum TimeFieldSets { }
enum ZoneFieldSets { }
struct CompositeFieldSet {
date: Option<DateFieldSets>,
// ...
}
struct DateTimeOptionsV1 {
length: Option<Length>,
year_style: Option<YearStyle>,
fractional_second_digits: Option<FractionalSecondDigits>,
// ...
}
// ALL NEW!
opaque DateTimeFormatter {
pub fn create(&Locale, CompositeFieldSet, DateTimeOptionsV1) -> Result<>
}
// ALL NEW!
opaque GregorianDateTimeFormatter {
pub fn create(&Locale, CompositeFieldSet, DateTimeOptionsV1) -> Result<>
}
opaque DateFormatter { ... } // icu::datetime::Formatter<YMD>
opaque GregorianDateFormatter { ... } // icu::datetime::Formatter<YMD>
opaque TimeFormatter { ... } // icu::datetime::Formatter<TimeComponents>
Types removed:
Note: I want to add more field-set-specific types, but @Manishearth suggests doing this after we have better namespacing post-2.0, and I agree.
How should I organize the code in the repo? I know this is banal, but I need to do something. What we currently have is a bit of a mess, and we need to move things anyway as part of the big rename, so I may as well move them into good places.
Currently, inside components/datetime/src:
Path | Description | Visibility |
---|---|---|
fields/*.rs | Field and its related types | Public |
format/datetime.rs | Core formatting logic | Private |
format/neo.rs | DateTimeNames, DateTimePatternFormatter | Private with public re-exported types |
options/*.rs | 1.x formatter options | Datagen-only except for the HourCycle type, which we could replace with the one from icu_preferences |
pattern/*.rs | Internal pattern types and logic | doc(hidden) |
provider/calendar/*.rs | 1.x data structs | Mix of public and datagen |
provider/neo/*.rs | 2.x data structs | Public |
provider/packed_pattern.rs | Packed pattern data struct, recently landed | Public |
provider/time_zones.rs | Time zone data structs | Public |
raw/neo.rs | DateTimeZonePatternSelectionData and its related types | Private |
skeleton/*.rs | Classical skeleton code | Datagen |
calendar.rs | CldrCalendar and related traits and trait impls | Private with public re-exported types |
error.rs | MismatchedCalendarError | Public |
external_loaders.rs | Fixed decimal formatter and calendar data loader helpers | Private |
helpers.rs | size_test macro | Private |
input.rs | ExtractedInput | Private |
lib.rs | Nothing defined, only re-exports | Public |
neo_marker.rs | Declarations and definitions of 2.x traits; field set markers | Public |
neo_pattern.rs | DateTimePattern | Public |
neo_serde.rs | Serde impls for 2.x things | Private |
neo_skeleton.rs | Enums and structs for semantic skeleta | Public |
neo.rs | Main 2.x formatter types | Public |
time_zone.rs | Time zone formatting | Private |
tz_registry.rs | Time zone format registry: mapping between semantic time zones, resolved time zones, and field-based time zones | Private |
All re-exports are from the root unless otherwise specified.
What I think I want to move:
Destination File Name | Stuff to move inside | Visibility |
---|---|---|
names.rs | DateTimeNames | Private with public re-exported types |
dt_pattern.rs | DateTimePattern, DateTimePatternFormatter | Private with public re-exported types |
raw.rs | What is currently raw/neo.rs | Private |
error.rs | All error enums throughout the crate | Public/private as needed |
fieldset.rs | Field set markers | Public |
scaffolding/*.rs | 2.x formatting traits. Also move CldrCalendar and friends in here | Public, but nothing in this module should show up in normal usage of the API |
skeleton_impl/*.rs | Rename of skeleton/*.rs | Datagen |
skeleton.rs | Enums and structs for semantic skeleta | Public |
formatter.rs | Main 2.x formatter types | Private with public re-exported types |
Note: I want everything to have exactly 1 place where it is exported.
Thoughts/approval? @Manishearth @robertbastian
This seems fine! I think we should be organized but it's flexible and we don't have to get it perfect right now. Something vaguely sensible is enough for me!
Type naming discussion:
Points brought up:
Formatter
feels too generic and we have man yformattersFieldSetFormatter
and SkeletonFormatter
are inside baseballyDateTimeFormatter
is fine as long as we never add a set of aliases like DateTimeFormatter, TimeFormatter, DateFormatter, etc for runtime fieldsetsDiscussion:
YearMonthFormatter
. I'm okay committing to no convenient aliases for the runtime ones.Conclusion:
Agreed: @sffc @manishearth @robertbastian
Review with Zibi:
fieldset!([year, month, day])::medium()
=> YMD::medium()
DateTimePattern verbiage:
Original: Most clients should use DateTimeFormatter instead of directly formatting with patterns.
[DateTimePattern
] forgoes most internationalization functionality of the datetime crate. It assumes that the pattern is already localized for the customer's locale. Most clients should use [DateTimeFormatter
] instead of directly formatting with patterns.
Type exports:
icu::datetime::pattern::DateTimeNames
icu::datetime::pattern::DateTimePattern
icu::datetime::pattern::DateTimePatternFormatter
On the filesystem:
names.rs
?pattern/mod.rs
?icu::datetime::private::pattern::Pattern
?icu::datetime::private::pattern::ReferencePattern
?Errors:
fieldset!([year, month, day])::medium()
=>YMD::medium()
Let's not use macros in type position, they don't work in every possible type position and this gets annoying quickly. I think it's fine to provide such a macro but having a fallback is good.
Notes from brief discussion with @Manishearth:
The user-facing field set related types can be put into 4 buckets
struct YMD { options }
enum DateFieldSet { YMD, ... }
struct DateSkeleton { field_set: DateFieldSet, options }
Alignment
or FractionalSecondDigits
Where should these all go?
My original idea (Option 1):
icu::datetime::fieldset::YMD
icu::datetime::skeleton::DateFieldSet
icu::datetime::skeleton::DateSkeleton
icu::datetime::skeleton::Alignment
One that @Manishearth suggested (Option 2):
icu::datetime::fieldset::YMD
icu::datetime::fieldset::runtime::DateFieldSet
icu::datetime::skeleton::DateSkeleton
icu::datetime::options::Alignment
Here's another one that might be good (Option 3):
icu::datetime::fieldset::YMD
icu::datetime::fieldset::DateFieldSet
icu::datetime::options::DateFieldSetWithOptions
icu::datetime::options::Alignment
I'm not sure about options::DateFieldSetWithOptions
. I could still put it at skeleton::DateSkeleton
. But skeleton
sounds more important, but in the ICU4X world, it's this thing that most people shouldn't generally always be using.
It also occurs to me that skeleton
and scaffold
are kind-of similar words, but they mean different things.
To add some points, personally I think the ideal situation is that the fieldsets, runtime fieldsets, skeleta are all in their own modules, containing nothing but those types, combiner types (Combo
), and potentially other modules. Exactly how that is achieved can be done in multiple ways, with fieldset
and fieldset::runtime
or fieldset
and fieldset_runtime
, and with skeletons being a submodule of fieldsets or options or something. No strong opinion there.
My vision is that each of these (except for options
) can have strong documentation about the usage of these things that the rest of the crate can link to.
FooFieldset
, it's clear that "I just should look at TimeFieldset and once I understand that then I will understand that DateFieldset". However if a module has DateFieldset, TimeFieldset, ..., and DateSkeleton, TimeSkeleton, .... then it becomes unclear how to slice things: should you look at one of the Date*
types and then one of the Time*
types (and so on) or should you look at one of the *Skeleton
types and one of the *Fieldset
types.RuntimeDateFieldset
, etc.neo_skeleton
module should come down to 2 types: skeleton structs, and the enums to represent the fieldsets.@sffc's current thinking:
datetime::fieldset::YMD
datetime::fieldset_dynamic::DateFieldSet
datetime::fieldset_dynamic::DateSkeleton
datetime::options::Alignment
@Manishearth But then we have two dimensions in the same module:
(Date, Time, DateTime, ZonedDateTime)(Skeleton, Fieldset)
@sffc We could reduce it to one type, like this:
// mod fieldset_dynamic
enum DateFieldSet {
YMD(fieldset::YMD),
MD(fieldset::MD),
}
@Manishearth But currently it's nice to write:
let skeleton = NeoDateSkeleton {YearMonthDay, YearStyle::whatever};
and this gets a bit more complicated on construction? depends on what the user patterns would be like.
Proposal:
datetime::fieldset::YMD
datetime::fieldset::dynamic::DateFieldSet
YMD(fieldset::YMD), etc
datetime::options::Alignment
datetime::fieldset::dynamic::builder::DateFieldSetBuilder
LGTM: @sffc @Manishearth
I made a proposal to switch around how time fields are handled:
https://docs.google.com/document/d/1SkxoitlCFiQ_KGW3dmRk7lbumGd_N1rfYJLMlkiXvh4/edit?tab=t.0
I started implementing this in ICU4X. My idea:
Hour
, Hour+Minute
, and Hour+Minute+Second
using the same mechanism that we have for distinguishing the three year stylesThen, I will update #5761 to switch around the dynamic field sets as discussed previously.
One small caveat: I realized that overlap patterns already use the variants for year style. However, none of our overlap patterns currently contain the year field, so I'm currently adding a debug assertion and changing the variants there to be for time precision instead.
Does this sound okay @Manishearth?
:+1:
This is a meta issue to track implementing the "ideal components bag" as laid out in the DateTimeFormat Deep Dive 2021-10-01 design document. Originally there was some discussion to have this replace the current components bag, but it is to be implemented alongside the existing components bag. A better name can be bikeshed if needed.
The following need to be completed.