unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.35k stars 174 forks source link

Finalize input data model for DateTimeFormat #355

Closed sffc closed 3 years ago

sffc commented 3 years ago

In datetime-input.md, I put forth a proposal for the trait used as input into DateTimeFormat. This issue is to track follow-up discussions and check in the result.

Task list:

sffc commented 3 years ago

CC @mihnita @macchiati @markusicu @younies

sffc commented 3 years ago

Notes from design discussion on this subject with @mihnita @macchiati @markusicu @younies and others:


Agreement on high-level concept (separation of concerns)

Mark: Did you cross-check these against the LDML spec? And against the ICU calendar?

Shane: I referenced LDML.

Mark: We always assume hour/minute/second for all countries. We could stick with milliseconds_in_day.

Mark: I would go with one field per format symbol.

Mihai: On the time, the milliseconds_in_day field is a misnomer.

Shane: Yeah; I got that name from LDML. But "millisecond_of_day" would be better.

Mihai: I would comment that in general, these field are all convertible with one another.

Decide on approach for era name variants (AD vs CE)

Markus: It makes more sense to me for variant selection to go elsewhere. Don't overload the structure.

Mark: We were thinking of putting this in the display context. I think that works better than having the calendar system pick it. It's a formatting issue, not a calendaring issue.

Markus: In Hebrew, we have a choice between printing digits, or having Hebrew do a spellout. That is a display option. I would expect to have a day of the month as a number in the struct, and have the display option live elsewhere.

Mihai: I think AD versus CE belongs in the same bucket as numbering system choice.

Finalize data model for months: merge month ID with month number or keep them separate

Markus: How does this work in ICU?

Mark: CLDR mixes the numbers and he identifiers and does resolution separately.

Markus: Shane is proposing to put the decision in the calendar layer, and keep the display layer as a lookup.

Mark: Yeah, I would go with a month number and an identifier.

Shane: It sounds like we have agreement on two separate fields for month number and month name.

String identifier vs. numeric identifier for month display names

Mark: This seems excessive. For weekdays, we just look it up according to the language. For month names, we're expanding it to cover every calendar system, when you could cover it with 13 digits plus a separate field for the calendar system.

Markus: It could work. It doesn't feel quite right. When we build up our structures, we have 12 or 13 month names. For month names, we always have a complete array.

Mihai: You could have an array with indices. I feel uneasy seeing calendar-specific keys in the data structure.

Mark: It seems simpler to have number indices.

Shane: You say 12 or 13 months, but Hebrew has 14 months, even though only 13 of them can occur in each year. Chinese has 24 months, 12 normal and 12 leap, even though a year has no more than 13 of them at a time. My proposal is to put the display names of months into a global namespace.

Markus: It feels weird, but I think it could work. I can't put my thumb on why it wouldn't work. Maybe we can try it and see how it works.

Markus: For eras, in the Japanese calendar, going back in time is murky. You should pick an era as era 0 and count forward and backward from there.

Mihai: For Japanese, if you use numbers, for new eras, you just add another number, which is easier than inventing a new string. Especially since you don't know the name of the era until a few weeks in advance.

Mark: The advantage of an index is that for most calendar systems, the name and the month number correspond to each other. The problem is that in certain calendar systems, the two are not correlated.

Mark: You could do "m01", "m02", … having a string "foobar" doesn't help me know that it's the 7th month in the calendar system.

Investigate day period and decide whether it should be computed or inputted

Mark: I think the day period should be decided by the calendar system. But, it tends to be a language/region computation rather than a calendar system calculation. So you could put it either place.

Mihai: To me it feels like a locale-specific thing.

Mark: The eras are closely linked to the calendar system. But the day periods are dissociated. If I were to do anything on the display side, it would be the day period stuff.

Mihai: When you go from the calendar to month names, you go from something locale-independent and then you make it locale-dependent. Going from month 6 to a string is a matter of translation. But deciding what is "afternoon" is locale-specific.

Shane: What data is required to figure out the day period?

Mark: We make an approximation. Seconds in day is sufficient for the calculation. In theory, we should go beyond that, and look not only at your locale, but also your location, because in a lot of places, evening starts at sunset. In China, sunset is at a very different time depending on your latitude and longitude.

Mihai: There are two layers: the calendar layer, and the locale layer. Some decisions are made on the calendar layer, and other decisions are made on the locale layer.

Shane: I would go further and say there are 3 layers: the calendar layer, the localization layer, and the rendering layer. The localization layer could include both day period resolution and era name selection. It could be swapped out for a more sophisticated day period selector that takes lat/lon into account.

sffc commented 3 years ago

Another advantage of a global namespace for month names: several calendar systems like Japanese and Buddhist use the Gregorian month names, but they have their own system for years and eras. With a global namespace, the Buddhist calendar could request month name "jan" and pull from the same data as Gregorian. With a nested namespace, we'd need to either duplicate the month names, or implement a fallback mechanism.

sffc commented 3 years ago

A pseudo-global namespace would be something like,

month-names:
  gregory-m001: January
  gregory-m002: February
  # ...
  hebrew-m011: Shevat
  hebrew-m012: Adar
  hebrew-m013: Adar I
  hebrew-m014: Adar II
  # ...
  indian-m001: Chaitra
  indian-m002: Vaisākha
  # ...

The Buddhist calendar could request month name "gregory-m001".

And for eras,

era-names:
  # calendar-era-variant
  gregory-e00: Before Christ
  gregory-e00-v00: Before Common Era
  gregory-e01: Anno Domini
  gregory-e01-v00: Common Era
  # Modern Japan: start from era ID 1000
  japanese-e1000: Meiji
  japanese-e1001: Taishō
  japanese-e1002: Shōwa
  japanese-e1003: Heisei
  japanese-e1003: Reiwa
  # Pre-1868: count down from 1000 with space to add missing eras
  japanese-e0990: Keiō
  japanese-e0980: Genji
  # ...
macchiati commented 3 years ago

Another advantage of a global namespace for month names: several calendar systems like Japanese and Buddhist use the Gregorian month names, but they have their own system for years and eras. With a global namespace, the Buddhist calendar could request month name "jan" and pull from the same data as Gregorian. With a nested namespace, we'd need to either duplicate the month names, or implement a fallback mechanism.

As per discussion, it requires one more piece of information to be communicated, which is the calendar system.

A pseudo-global namespace would be something like,

I think it would be simpler to have:

calendar-system: japanese // tiny string month-names-index: 12 // short int era-names-index: 42 // short int

rather than:

month-names-index: japanese-m012 // string era-names-index: japanese-e0042 // string

sffc commented 3 years ago

As per discussion, it requires one more piece of information to be communicated, which is the calendar system.

I'm sorry, I don't understand. I am proposing a model specifically designed to avoid the need to give the calendar system to the data bundle as a separate argument. Since Gregorian, Japanese, and Buddhist all share the same month names, my proposal is to have a single global namespace for month display names, and all three of those calendars can access the exact same resources.

I think it would be simpler to have:

calendar-system: japanese // tiny string month-names-index: 12 // short int era-names-index: 42 // short int

rather than:

month-names-index: japanese-m012 // string era-names-index: japanese-e0042 // string

Your example still duplicates data in a nested calendar system structure. This is not what I'm proposing. My proposal is more along the lines of

month-names-index: gregory-m012 // tinystr pair era-names-index: japanese-e0042 // tinystr pair

sffc commented 3 years ago

Separately, I'm not a very big fan of indexing the month names with numbers, because it is misleading when dealing with calendars with leap months. In the Hebrew calendar, one does not simply request "display name for the 12th month", because there are 2 possibilities for that display name. I think it is more clear to index month names by a string identifier to clearly indicate that there is not necessarily a correlation between the month number and the month name index.

sffc commented 3 years ago

@pedberg-icu pointed out that the data model for months needs to account for the CLDR month name patterns in the Chinese calendar:

https://unicode.org/reports/tr35/tr35-dates.html#monthPatterns_cyclicNameSets

In particular, the numeric form "M" in the Chinese calendar is not simply a number; it needs to have the month pattern applied to it.

For day periods, there seems to be agreement that a and b can be deterministically derived from the time of day. Day period B is the one that needs the more sophisticated algorithm.

sffc commented 3 years ago

@pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.

I will follow up with an updated proposal.

pedberg-icu commented 3 years ago

On Oct 28, 2020, at 1:26 PM, Shane F. Carr notifications@github.com wrote:

@pedberg-icu https://github.com/pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.

Actually MMMM

  • Peter I will follow up with an updated proposal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu4x/issues/355#issuecomment-718188279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ.

macchiati commented 3 years ago

I put together a spreadsheet of calendar systems and items (eras & months)

https://docs.google.com/spreadsheets/d/1iIaJ-j-EQRyo0jPLp6rdStTwwBQYawoeRoRVJaR_Ytc/edit#gid=735277865

It coalesces items where CLDR aliases them in root. So, for example, because the buddhist calendar months alias to gregorian, the buddhist months don't need a separate enum. We would want to review those aliases to make sure they are correct and complete: that they are intentional, and there are no others that can be coalesced (eg maybe generic and gregorian).

Mark

On Wed, Oct 28, 2020 at 4:10 PM Peter Edberg notifications@github.com wrote:

On Oct 28, 2020, at 1:26 PM, Shane F. Carr notifications@github.com wrote:

@pedberg-icu https://github.com/pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.

Actually MMMM

  • Peter I will follow up with an updated proposal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/unicode-org/icu4x/issues/355#issuecomment-718188279>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu4x/issues/355#issuecomment-718259838, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBXA7D7AEQ7YSYROG3SNCQGPANCNFSM4SRRIZPQ .

sffc commented 3 years ago

The spreadsheet is very helpful; thanks!

My main takeaway is that we need only 133 strings (which need wide/short/narrow) to support all month names and all eras in all calendars*. That might be small enough that we can ship it in the ICU4X data provider by default, such that we support formatting in all calendar systems out of the box. If it's not small enough, we can give users options to remove data for calendars they don't use.

* As @yumaoka suggested, I omitted the pre-modern Japanese eras.

sffc commented 3 years ago

Okay, here's a proposal for how we can encode the data for month names, including leap months:

month_names:
  gregory-m001:
    long: January
    short: Jan
    narrow: J
    numeric: {0}
  # ...
  chinese-m001:
    long: First Month
    short: M01
    narrow: {0}
    numeric: {0}
  chinese-m001-leap:
    long: First Monthbis
    short: M01bis
    narrow: {0}b
    numeric: {0}bis

This trades a little extra data for less complicated code. The specification of the data would be:

Note: "Monthbis" is the language from the current CLDR specification. That's probably not right.

new Date(2001, 5, 1).toLocaleDateString("en-u-ca-chinese", { dateStyle: "long" })
// "Fourth Monthbis 10, 2001(xin-si)"
mihnita commented 3 years ago

I think that there is a "hidden assumption" here that is not necessarily true.

We have calendars that are "Gregorian-like" (12 months, maybe even extending Gregorian in implementation, the way BuddhistCalendar, Japanese, Taiwan are). The calculations work, it's all good...

But it does not mean at all that the month names will be translated the same in all the languages. Just because in English we use "January" to name the first month of the Japanese Calendar it does not necessarily mean that this is the case for all the languages. Or that will always be true in English.

In other words, something like The Buddhist calendar could request month name "gregory-m001 unnecessarily ties together the MONTH NAMES of the Buddhist & Gregorian calendars. Only because the two calendars are very close in behavior, and linguistically close (for now), in English.

mihnita commented 3 years ago

I did a quick check. It looks like currently all "Gregorian months" are translated the same in most languages, except some Chinese locales: zh-Hans-HK : gregory:二月 japanese:二月 buddhist:二月 roc:2月 zh-Hans-MO : gregory:二月 japanese:二月 buddhist:二月 roc:2月 zh-Hans-SG : gregory:2月 japanese:二月 buddhist:二月 roc:2月

So for zh-SG "gregory-m002" != "buddhist-m002", right now.

sffc commented 3 years ago

Okay, I'm convinced.

I think we can change the data model like this:

month_names:
  gregory:
    101:
      long: January
      short: Jan
      narrow: J
      numeric: {0}
    102:
      long: February
      short: Feb
      narrow: F
      numeric: {0}
    # ...
  chinese:
    101:
      long: First Month
      short: M01
      narrow: {0}
      numeric: {0}
    102:
      long: Second Month
      short: M02
      narrow: {0}
      numeric: {0}
    # ...
    201:
      long: First Monthbis
      short: M01bis
      narrow: {0}b
      numeric: {0}bis
    202:
      long: Second Monthbis
      short: M02bis
      narrow: {0}b
      numeric: {0}bis

If a language-calendar pair wants to fall back to a different calendar, we can use #259 to perform that fallback. However, if it wants to override the data, it can add an additional entry in the data structure above.

Q: Shane, why did you start month numbering at 101 instead of 1?

A: Because I really, really don't want people to get used to the idea of a month number being equivalent to a month name identifier. We already know this isn't the case in multiple calendar systems, like Chinese.

sffc commented 3 years ago

Shane to follow up with a concrete PR.

sffc commented 3 years ago

Depends on #409

sffc commented 3 years ago

Okay, I started something in #445.

I'm trying something a bit different than what I proposed above. Here's my trait:

pub trait NewDateTimeType {
    fn julian_day(&self) -> JulianDay;
    fn year(&self) -> Year;
    fn year_week(&self) -> Year;
    fn quarter(&self) -> Quarter;
    fn month(&self) -> Month;
    fn time(&self) -> Time;
}

Note: the Julian day is the number of days since the Julian epoch (Wikipedia).

Subtypes:

pub struct Era(pub TinyStr8);

pub struct CyclicYear(pub TinyStr8);

pub struct Quarter(pub u8);

pub struct MonthCode(pub TinyStr8);

pub struct JulianDay(pub i64);

pub struct Year {
    pub start: JulianDay,

    pub era: Era,
    pub number: usize,   // FIXME: i64
    pub extended: usize, // FIXME: i64
    pub cyclic: CyclicYear,
}

pub struct Month {
    pub start: JulianDay,

    pub number: usize, // FIXME: i64
    pub code: MonthCode,
}

pub enum FractionalSecond {
    Whole,
    Millisecond(u16),
    Microsecond(u32),
    Nanosecond(u32),
}

pub struct Time {
    pub hour: u8,
    pub minute: u8,
    pub second: u8,
    pub fractional: FractionalSecond,
}

I think that we can compute all of the UTS 35 fields from this information, except for B, with the following assumptions assumed to work across calendar systems:

I was considering two levels of traits: this shortcut trait, and another trait with full field coverage. However, I would like to keep as much of this part of the algorithm as possible inside the library.

sffc commented 3 years ago

New discovery regarding weeks: the choice of when to set the year cutoff on the week-of-year calendars (capital Y in patterns) appears to be locale-dependent, at least based on the current ICU4J implementation. For example, December 27, 1970 was a Sunday; it is the first day of 1971 in the American week calendar, but the last day of 1970 in the British week calendar.

To test: go to the icu4j MessageFormat demo and enter the following:

Switch the locale between en-US and en-GB to observe the difference.

Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)

zbraniecki commented 3 years ago

Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)

Yes it is. We had to implement minDays in CalendarInfo to support Week Of Year in calendar UI - https://firefox-source-docs.mozilla.org/intl/dataintl.html#mozintl-getcalendarinfo-locale

sffc commented 3 years ago

Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)

Yes it is. We had to implement minDays in CalendarInfo to support Week Of Year in calendar UI - https://firefox-source-docs.mozilla.org/intl/dataintl.html#mozintl-getcalendarinfo-locale

Cool. Is the algorithm for determining the Week of Year cutoff deterministic across calendar systems? Like, say that the first day of the year is a Wednesday. With a combination of the locale-specific data in mozIntl, you can figure out whether that Wednesday should be considered 2020 or 2021. Does that work in systems other than Gregorian? Or is "week of year" just not used anywhere other than Gregorian?

EDIT: I think we can structure the trait in a way that avoids the need to answer this question.

sffc commented 3 years ago

Here are my latest traits:

pub trait NewDateTimeType {
    fn year(&self) -> Year;
    fn prev_year(&self) -> Year;
    fn next_year(&self) -> Year;
    fn quarter(&self) -> Quarter;
    fn month(&self) -> Month;
    fn day_of_year(&self) -> DayOfYear;
    fn day_of_month(&self) -> DayOfMonth;
    fn weekday(&self) -> Weekday;
    fn time(&self) -> Time;
}

pub trait FullDateTime: NewDateTimeType {
    fn year_week(&self) -> Year;
    fn week_of_month(&self) -> WeekOfMonth;
    fn week_of_year(&self) -> WeekOfYear;
    fn flexible_day_period(&self) -> FlexibleDayPeriod;
}

The first, NewDateTimeType, is the one expected to be implemented by external date libraries. The second, FullDateTime, combines NewDateTimeType with a Locale to fill in additional information.

The .prev_year() and .next_year() functions are present only to support week-of-year calculations.

sffc commented 3 years ago

2021-01-15: What I proposed above looks OK.