Closed sffc closed 3 years ago
CC @mihnita @macchiati @markusicu @younies
Notes from design discussion on this subject with @mihnita @macchiati @markusicu @younies and others:
Mark: Did you cross-check these against the LDML spec? And against the ICU calendar?
Shane: I referenced LDML.
Mark: We always assume hour/minute/second for all countries. We could stick with milliseconds_in_day.
Mark: I would go with one field per format symbol.
Mihai: On the time, the milliseconds_in_day field is a misnomer.
Shane: Yeah; I got that name from LDML. But "millisecond_of_day" would be better.
Mihai: I would comment that in general, these field are all convertible with one another.
Markus: It makes more sense to me for variant selection to go elsewhere. Don't overload the structure.
Mark: We were thinking of putting this in the display context. I think that works better than having the calendar system pick it. It's a formatting issue, not a calendaring issue.
Markus: In Hebrew, we have a choice between printing digits, or having Hebrew do a spellout. That is a display option. I would expect to have a day of the month as a number in the struct, and have the display option live elsewhere.
Mihai: I think AD versus CE belongs in the same bucket as numbering system choice.
Markus: How does this work in ICU?
Mark: CLDR mixes the numbers and he identifiers and does resolution separately.
Markus: Shane is proposing to put the decision in the calendar layer, and keep the display layer as a lookup.
Mark: Yeah, I would go with a month number and an identifier.
Shane: It sounds like we have agreement on two separate fields for month number and month name.
Mark: This seems excessive. For weekdays, we just look it up according to the language. For month names, we're expanding it to cover every calendar system, when you could cover it with 13 digits plus a separate field for the calendar system.
Markus: It could work. It doesn't feel quite right. When we build up our structures, we have 12 or 13 month names. For month names, we always have a complete array.
Mihai: You could have an array with indices. I feel uneasy seeing calendar-specific keys in the data structure.
Mark: It seems simpler to have number indices.
Shane: You say 12 or 13 months, but Hebrew has 14 months, even though only 13 of them can occur in each year. Chinese has 24 months, 12 normal and 12 leap, even though a year has no more than 13 of them at a time. My proposal is to put the display names of months into a global namespace.
Markus: It feels weird, but I think it could work. I can't put my thumb on why it wouldn't work. Maybe we can try it and see how it works.
Markus: For eras, in the Japanese calendar, going back in time is murky. You should pick an era as era 0 and count forward and backward from there.
Mihai: For Japanese, if you use numbers, for new eras, you just add another number, which is easier than inventing a new string. Especially since you don't know the name of the era until a few weeks in advance.
Mark: The advantage of an index is that for most calendar systems, the name and the month number correspond to each other. The problem is that in certain calendar systems, the two are not correlated.
Mark: You could do "m01", "m02", … having a string "foobar" doesn't help me know that it's the 7th month in the calendar system.
Mark: I think the day period should be decided by the calendar system. But, it tends to be a language/region computation rather than a calendar system calculation. So you could put it either place.
Mihai: To me it feels like a locale-specific thing.
Mark: The eras are closely linked to the calendar system. But the day periods are dissociated. If I were to do anything on the display side, it would be the day period stuff.
Mihai: When you go from the calendar to month names, you go from something locale-independent and then you make it locale-dependent. Going from month 6 to a string is a matter of translation. But deciding what is "afternoon" is locale-specific.
Shane: What data is required to figure out the day period?
Mark: We make an approximation. Seconds in day is sufficient for the calculation. In theory, we should go beyond that, and look not only at your locale, but also your location, because in a lot of places, evening starts at sunset. In China, sunset is at a very different time depending on your latitude and longitude.
Mihai: There are two layers: the calendar layer, and the locale layer. Some decisions are made on the calendar layer, and other decisions are made on the locale layer.
Shane: I would go further and say there are 3 layers: the calendar layer, the localization layer, and the rendering layer. The localization layer could include both day period resolution and era name selection. It could be swapped out for a more sophisticated day period selector that takes lat/lon into account.
Another advantage of a global namespace for month names: several calendar systems like Japanese and Buddhist use the Gregorian month names, but they have their own system for years and eras. With a global namespace, the Buddhist calendar could request month name "jan" and pull from the same data as Gregorian. With a nested namespace, we'd need to either duplicate the month names, or implement a fallback mechanism.
A pseudo-global namespace would be something like,
month-names:
gregory-m001: January
gregory-m002: February
# ...
hebrew-m011: Shevat
hebrew-m012: Adar
hebrew-m013: Adar I
hebrew-m014: Adar II
# ...
indian-m001: Chaitra
indian-m002: Vaisākha
# ...
The Buddhist calendar could request month name "gregory-m001".
And for eras,
era-names:
# calendar-era-variant
gregory-e00: Before Christ
gregory-e00-v00: Before Common Era
gregory-e01: Anno Domini
gregory-e01-v00: Common Era
# Modern Japan: start from era ID 1000
japanese-e1000: Meiji
japanese-e1001: Taishō
japanese-e1002: Shōwa
japanese-e1003: Heisei
japanese-e1003: Reiwa
# Pre-1868: count down from 1000 with space to add missing eras
japanese-e0990: Keiō
japanese-e0980: Genji
# ...
Another advantage of a global namespace for month names: several calendar systems like Japanese and Buddhist use the Gregorian month names, but they have their own system for years and eras. With a global namespace, the Buddhist calendar could request month name "jan" and pull from the same data as Gregorian. With a nested namespace, we'd need to either duplicate the month names, or implement a fallback mechanism.
As per discussion, it requires one more piece of information to be communicated, which is the calendar system.
A pseudo-global namespace would be something like,
I think it would be simpler to have:
calendar-system: japanese // tiny string month-names-index: 12 // short int era-names-index: 42 // short int
rather than:
month-names-index: japanese-m012 // string era-names-index: japanese-e0042 // string
As per discussion, it requires one more piece of information to be communicated, which is the calendar system.
I'm sorry, I don't understand. I am proposing a model specifically designed to avoid the need to give the calendar system to the data bundle as a separate argument. Since Gregorian, Japanese, and Buddhist all share the same month names, my proposal is to have a single global namespace for month display names, and all three of those calendars can access the exact same resources.
I think it would be simpler to have:
calendar-system: japanese // tiny string month-names-index: 12 // short int era-names-index: 42 // short int
rather than:
month-names-index: japanese-m012 // string era-names-index: japanese-e0042 // string
Your example still duplicates data in a nested calendar system structure. This is not what I'm proposing. My proposal is more along the lines of
month-names-index: gregory-m012 // tinystr pair era-names-index: japanese-e0042 // tinystr pair
Separately, I'm not a very big fan of indexing the month names with numbers, because it is misleading when dealing with calendars with leap months. In the Hebrew calendar, one does not simply request "display name for the 12th month", because there are 2 possibilities for that display name. I think it is more clear to index month names by a string identifier to clearly indicate that there is not necessarily a correlation between the month number and the month name index.
@pedberg-icu pointed out that the data model for months needs to account for the CLDR month name patterns in the Chinese calendar:
https://unicode.org/reports/tr35/tr35-dates.html#monthPatterns_cyclicNameSets
In particular, the numeric form "M" in the Chinese calendar is not simply a number; it needs to have the month pattern applied to it.
For day periods, there seems to be agreement that a
and b
can be deterministically derived from the time of day. Day period B
is the one that needs the more sophisticated algorithm.
@pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.
I will follow up with an updated proposal.
On Oct 28, 2020, at 1:26 PM, Shane F. Carr notifications@github.com wrote:
@pedberg-icu https://github.com/pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.
Actually MMMM
- Peter I will follow up with an updated proposal.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu4x/issues/355#issuecomment-718188279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ.
I put together a spreadsheet of calendar systems and items (eras & months)
It coalesces items where CLDR aliases them in root. So, for example, because the buddhist calendar months alias to gregorian, the buddhist months don't need a separate enum. We would want to review those aliases to make sure they are correct and complete: that they are intentional, and there are no others that can be coalesced (eg maybe generic and gregorian).
Mark
On Wed, Oct 28, 2020 at 4:10 PM Peter Edberg notifications@github.com wrote:
On Oct 28, 2020, at 1:26 PM, Shane F. Carr notifications@github.com wrote:
@pedberg-icu https://github.com/pedberg-icu also noted that the Hebrew calendar no longer has any numeric patterns for the month. The month in that calendar is currently always represented by MMM.
Actually MMMM
- Peter I will follow up with an updated proposal.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/unicode-org/icu4x/issues/355#issuecomment-718188279>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AKBS4KCYOK7SXHCAOKQKAHDSNB465ANCNFSM4SRRIZPQ .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu4x/issues/355#issuecomment-718259838, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBXA7D7AEQ7YSYROG3SNCQGPANCNFSM4SRRIZPQ .
The spreadsheet is very helpful; thanks!
My main takeaway is that we need only 133 strings (which need wide/short/narrow) to support all month names and all eras in all calendars*. That might be small enough that we can ship it in the ICU4X data provider by default, such that we support formatting in all calendar systems out of the box. If it's not small enough, we can give users options to remove data for calendars they don't use.
* As @yumaoka suggested, I omitted the pre-modern Japanese eras.
Okay, here's a proposal for how we can encode the data for month names, including leap months:
month_names:
gregory-m001:
long: January
short: Jan
narrow: J
numeric: {0}
# ...
chinese-m001:
long: First Month
short: M01
narrow: {0}
numeric: {0}
chinese-m001-leap:
long: First Monthbis
short: M01bis
narrow: {0}b
numeric: {0}bis
This trades a little extra data for less complicated code. The specification of the data would be:
"{0}"
."{0}"
are substituted with the month number formatted in the local numbering system."{0}"
.Note: "Monthbis" is the language from the current CLDR specification. That's probably not right.
new Date(2001, 5, 1).toLocaleDateString("en-u-ca-chinese", { dateStyle: "long" })
// "Fourth Monthbis 10, 2001(xin-si)"
I think that there is a "hidden assumption" here that is not necessarily true.
We have calendars that are "Gregorian-like" (12 months, maybe even extending Gregorian in implementation, the way BuddhistCalendar, Japanese, Taiwan are). The calculations work, it's all good...
But it does not mean at all that the month names will be translated the same in all the languages. Just because in English we use "January" to name the first month of the Japanese Calendar it does not necessarily mean that this is the case for all the languages. Or that will always be true in English.
In other words, something like The Buddhist calendar could request month name "gregory-m001
unnecessarily ties together the MONTH NAMES of the Buddhist & Gregorian calendars. Only because the two calendars are very close in behavior, and linguistically close (for now), in English.
I did a quick check. It looks like currently all "Gregorian months" are translated the same in most languages, except some Chinese locales: zh-Hans-HK : gregory:二月 japanese:二月 buddhist:二月 roc:2月 zh-Hans-MO : gregory:二月 japanese:二月 buddhist:二月 roc:2月 zh-Hans-SG : gregory:2月 japanese:二月 buddhist:二月 roc:2月
So for zh-SG "gregory-m002" != "buddhist-m002", right now.
Okay, I'm convinced.
I think we can change the data model like this:
month_names:
gregory:
101:
long: January
short: Jan
narrow: J
numeric: {0}
102:
long: February
short: Feb
narrow: F
numeric: {0}
# ...
chinese:
101:
long: First Month
short: M01
narrow: {0}
numeric: {0}
102:
long: Second Month
short: M02
narrow: {0}
numeric: {0}
# ...
201:
long: First Monthbis
short: M01bis
narrow: {0}b
numeric: {0}bis
202:
long: Second Monthbis
short: M02bis
narrow: {0}b
numeric: {0}bis
If a language-calendar pair wants to fall back to a different calendar, we can use #259 to perform that fallback. However, if it wants to override the data, it can add an additional entry in the data structure above.
Q: Shane, why did you start month numbering at 101 instead of 1?
A: Because I really, really don't want people to get used to the idea of a month number being equivalent to a month name identifier. We already know this isn't the case in multiple calendar systems, like Chinese.
Shane to follow up with a concrete PR.
Depends on #409
Okay, I started something in #445.
I'm trying something a bit different than what I proposed above. Here's my trait:
pub trait NewDateTimeType {
fn julian_day(&self) -> JulianDay;
fn year(&self) -> Year;
fn year_week(&self) -> Year;
fn quarter(&self) -> Quarter;
fn month(&self) -> Month;
fn time(&self) -> Time;
}
Note: the Julian day is the number of days since the Julian epoch (Wikipedia).
Subtypes:
pub struct Era(pub TinyStr8);
pub struct CyclicYear(pub TinyStr8);
pub struct Quarter(pub u8);
pub struct MonthCode(pub TinyStr8);
pub struct JulianDay(pub i64);
pub struct Year {
pub start: JulianDay,
pub era: Era,
pub number: usize, // FIXME: i64
pub extended: usize, // FIXME: i64
pub cyclic: CyclicYear,
}
pub struct Month {
pub start: JulianDay,
pub number: usize, // FIXME: i64
pub code: MonthCode,
}
pub enum FractionalSecond {
Whole,
Millisecond(u16),
Microsecond(u32),
Nanosecond(u32),
}
pub struct Time {
pub hour: u8,
pub minute: u8,
pub second: u8,
pub fractional: FractionalSecond,
}
I think that we can compute all of the UTS 35 fields from this information, except for B
, with the following assumptions assumed to work across calendar systems:
I was considering two levels of traits: this shortcut trait, and another trait with full field coverage. However, I would like to keep as much of this part of the algorithm as possible inside the library.
New discovery regarding weeks: the choice of when to set the year cutoff on the week-of-year calendars (capital Y
in patterns) appears to be locale-dependent, at least based on the current ICU4J implementation. For example, December 27, 1970 was a Sunday; it is the first day of 1971 in the American week calendar, but the last day of 1970 in the British week calendar.
To test: go to the icu4j MessageFormat demo and enter the following:
{0, date, EEEE, 'week' w 'of' Y (MMM dd)}
1970-12-27 00:00:00
Switch the locale between en-US and en-GB to observe the difference.
Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)
Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)
Yes it is. We had to implement minDays
in CalendarInfo
to support Week Of Year in calendar UI - https://firefox-source-docs.mozilla.org/intl/dataintl.html#mozintl-getcalendarinfo-locale
Is the ICU4J behavior correct? (If so, it will affect the design of when we need to ingest locale information.)
Yes it is. We had to implement
minDays
inCalendarInfo
to support Week Of Year in calendar UI - https://firefox-source-docs.mozilla.org/intl/dataintl.html#mozintl-getcalendarinfo-locale
Cool. Is the algorithm for determining the Week of Year cutoff deterministic across calendar systems? Like, say that the first day of the year is a Wednesday. With a combination of the locale-specific data in mozIntl, you can figure out whether that Wednesday should be considered 2020 or 2021. Does that work in systems other than Gregorian? Or is "week of year" just not used anywhere other than Gregorian?
EDIT: I think we can structure the trait in a way that avoids the need to answer this question.
Here are my latest traits:
pub trait NewDateTimeType {
fn year(&self) -> Year;
fn prev_year(&self) -> Year;
fn next_year(&self) -> Year;
fn quarter(&self) -> Quarter;
fn month(&self) -> Month;
fn day_of_year(&self) -> DayOfYear;
fn day_of_month(&self) -> DayOfMonth;
fn weekday(&self) -> Weekday;
fn time(&self) -> Time;
}
pub trait FullDateTime: NewDateTimeType {
fn year_week(&self) -> Year;
fn week_of_month(&self) -> WeekOfMonth;
fn week_of_year(&self) -> WeekOfYear;
fn flexible_day_period(&self) -> FlexibleDayPeriod;
}
The first, NewDateTimeType, is the one expected to be implemented by external date libraries. The second, FullDateTime, combines NewDateTimeType with a Locale to fill in additional information.
The .prev_year()
and .next_year()
functions are present only to support week-of-year calculations.
2021-01-15: What I proposed above looks OK.
In datetime-input.md, I put forth a proposal for the trait used as input into DateTimeFormat. This issue is to track follow-up discussions and check in the result.
Task list: