unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
232 stars 34 forks source link

Restrict literals for `:date` and `:time` #680

Open eemeli opened 8 months ago

eemeli commented 8 months ago

This is a spin-off from #665, which conflates this issue's nominal topic with a discussion of the timeZone options. The text below is an extract of the original by @aphillips.


@eemeli noted that some literal values don't make sense with some of the date/time functions. Notably date literal with :time or time literal with :date.

In LDML45, I propose:

Note that not accepting all three types on all three functions makes for potential issues with declarations, since the temporal value would be rejected instead of passed to later calls. Examples:

.local $date = {|2024-02-17| :datetime} // don't blow up!
{{Today is {$date}.}}
eemeli commented 8 months ago

For the tech preview, I'm starting to think we probably ought to restrict the literal values for each of :date, :time, and :datetime to only support a full date+time string. This is the only option that allows us to sidestep the following concerns:

  1. What is the result of formatting a time as a date?

    .local $t = {12:34 :time}
    {{The date is {$t :date}.}}

    The choices here include at least the following:

    1. Emit an error, and fall back to {|12:34|}.
    2. The current date.
    3. An arbitrary "zero" date, such as 1900-01-01 or 1970-01-01.
  2. What is the result of formatting a date as a time?

    .local $d = {2024-01-30 :date}
    {{The time is {$d :time}.}}

    The choices here include at least the following:

    1. Emit an error, and fall back to {|2024-01-30|}.
    2. The time 00:00 in the system time.
    3. The time 00:00 with the offset given to the date literal.
  3. Is it reasonable to require introducing new date/time parsing requirements for implementations?

This last one is of particular interest for JavaScript, where the language does not currently provide for a time-only parser. The Temporal proposal does introduce Temporal.PlainTime.from(), but that isn't in the spec yet. We already intend to include a note adding a dependency on Temporal regarding time zone name serialization, so I think extending that to also cover time parsing would be appropriate.

I also have a specific concern about relying on the XML date specification, as it supports including an optional timezone offset on a date without a time, such as 2024-02-22+02:00. This is invalid in ISO 8601, which would in many cases require implementing a custom parser just for :date.

On the other hand, if we only support a full XML dateTime, we don't need to find answers to the above questions in the very short term, and we only inconvenience a very marginal set of use cases.

eemeli commented 8 months ago

As one minor relaxation, we could also support plain yyyy-mm-dd dates, as long as they did not include timezones. That's valid ISO 8601, and when used as a literal it's reasonable to fill out with a 00:00:00 time from the system timezone.

That corresponds to the following regexp once its whitespace is removed:

-?([1-9][0-9]{3,}|0[0-9]{3})
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
(T(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?|(24:00:00(\.0+)?))
(Z|[+-]((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?)?
aphillips commented 8 months ago

The use of XMLSchema was semi-arbitrary on my part. We could use RFC3339 instead. The definitions there might be date-time and full-date. We might make full-time and partial-time optional (as you suggest) or save them for later. If we can take one, I'd take partial-time (i.e. with no offset).

I don't think it makes that much sense to require date-time/dateTime for every date/time literal. Users will want to write just one part or the other for :date or :time.

The offset in both XMLSchema and 3339 is tricky. It is NOT a time zone. It adjusts the value by the offset's number of hours and minutes. I can totally live without the headache on date literals or time literals in the Tech Preview.

eemeli commented 8 months ago

Based on RFC3339, the regexp would be a bit simpler:

[0-9]{4}
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
(T([01][0-9]|2[0-3]):[0-5][0-9]:([0-5][0-9]|60)(\.[0-9]+)?
(Z|[+-]([01][0-9]|2[0-3]):[0-5][0-9])?)?

If we wanted to hedge our bets, this is the subset that's valid for both:

[0-9]{4}
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
(T([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\.[0-9]+)?
(Z|[+-]((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?)?

I don't think it makes that much sense to require date-time/dateTime for every date/time literal. Users will want to write just one part or the other for :date or :time.

For :date we can support that pretty easily with either of the above, but I don't think we have the bandwidth to solve it for :time.

eemeli commented 8 months ago

As pointed out by @hsivonen during the ECMA-402 call, the HTML spec provides another reasonable ISO 8601 subset for us to depend on. That one supports 4+ digit years > 0, space as well as T as the time separator, does not allow for leap seconds, has seconds optional, max 3 fractional second digits, and less than 24 hours of timezone offset.

Then there's one in the JS spec as well, which extends the HTML definition by allowing [+-]\d{6} years (except for -000000) and allows using , as the seconds decimal separator (but not space instead of T).

I had not really realised before how much variance there was on these, and how they seem to semi-randomly restrict the expression of some datetimes and allow others.

As far as I can tell, the core points of variance are: Question XSD RFC3339 HTML JS
Allow negative years? yes yes
Allow year 0? yes yes
Allow years > 9999? yes yes yes
Allow the time 24:00:00? yes
Allow leap seconds (60)? yes
Are seconds optional? yes yes
Allow more than 3 fractional second digits? yes yes
Allow offsets more than 14:00? yes yes yes

As we aren't actually defining how a string gets parsed, and we're not really able to represent accurate limits on e.g. max days within a month using a regexp, I think the only sensible thing we can do is call our dates ISO 8601, make the regexp somewhat lax, and explicitly note that implementations may impose further restrictions on validity.

Which I think leaves us with this:

-?[0-9]{4,}
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
(T([01][0-9]|2[0-3]):[0-5][0-9]:([0-5][0-9]|60)(\.[0-9]+)?
(Z|[+-]([01][0-9]|2[0-3]):[0-5][0-9])?)?
aphillips commented 8 months ago

Yes, the variability sucks.

I disagree with just saying "ISO 8601" (which is far more expansive than just date/time values). And I disagree with being "permissive".

We want messages to be portable and the writing of messages to be portable--not just for developers/translators, but also for tools and the larger ecosystem. It would be one thing to require implementations to support a relaxed syntax. It's quite another for us to allow a relaxed syntax for input and then all of the implementations be more strict (sort of an inverse Postel situation).

I would also be somewhat unhappy as an implementer if I had to support a custom date/time format.

One reason I changed to proposing 3339 above is that SEDATE extends 3339. Other folks will then need to consume the changes downstream. Support for real time zone names in date/time stamps isn't just important for Temporal. It's actually important for every time-related application.


A couple of side notes. The 8601-based syntax depends on a proleptic Gregorian calendar, so dates before 1582 CE (and a lot of dates after that, due to adoption issues) are already on a tenuous footing. The year 0 doesn't actually exist (the year before 1 CE was 1 BCE, although not, of course, to the inhabitants of the era: Romans might have said it was the year 738 AUC). I'm not against being able to represent timestamps that violate this (and other) restrictions. I just note the complexity and diminished utility of some of these features.

eemeli commented 8 months ago

To rephrase my general viewpoint here, I would like us to arrive at some solution where a valid JS implementation can use either new Date(input) or once Temporal lands, Temporal.Instant.from(input), as ways to parse a literal string input, without requiring all other implementations to re-implement the exact details of JS parsing. Especially as the two JS approaches have different limits.

To achieve that, the solution I propose is to explicitly define a regexp for some subset of ISO 8601 that is valid ~everywhere, and to allow implementations to accept inputs beyond that.

aphillips commented 8 months ago

Fair enough. That works for me, although I'm leery of the "my message worked in X but not on Y" problem.

eemeli commented 8 months ago

That's why the custom regexp in #687, because anything matching that should work everywhere, as long as it doesn't use a non-existent day like 2024-02-31.

aphillips commented 8 months ago

Keeping this issue alive for further discussion of date/time/datetime literals.