unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.3k stars 167 forks source link

Are we okay with producing era aliases in ICU4X? #3869

Open Manishearth opened 11 months ago

Manishearth commented 11 months ago

Temporal defines era codes, with the canonical era code for each calendar being very predictable (it is the name of the calendar, or the name of the calendar with -inverse in the inverse case), and then aliases that are more user-friendly (like "bc" etc).

We should consume all of these (date construction should not care which one you provide). Should we produce the canonical ones, or are we okay with producing aliases? AIUI Temporal plans to always produce canonical ones.

cc @sffc @zbraniecki

sffc commented 11 months ago

FormattableYear could have non-canonical era codes (e.g. maybe all the Islamic cals should return "ah") but when we have a real year API like TemporalYear we should return canonical codes.

Manishearth commented 11 months ago

I'm not sure if we should have Temporal in the type name. Year might be fine though.

Manishearth commented 11 months ago

I'm imagining

enum Year {
   EraYear {
     era: Era,
     year: i32
   },
   CyclicYear {
    cycle: u32,
    related_iso: i32,
    // ?
    extended_year: i32
   }
}
Manishearth commented 11 months ago

We should also decide what we are okay consuming. Some questions:

sffc commented 9 months ago

Discuss with:

Optional:

Manishearth commented 4 months ago

Given https://github.com/unicode-org/icu4x/issues/3821#issuecomment-1962141180, I think we are going to split Month / FormattableMonth in 2.0

Given this, my proposal is:

sffc commented 4 months ago

In a greater level of detail:

Open question: does the trait function icu_datetime::input::DateInput::year change to ::formattable_year?

Manishearth commented 4 months ago

I think that would be good, yes. I don't feel super strongly.

I do think that DateInput may benefit from having its functions renamed to be more obscure so they don't cause conflicts. input_year or something

justingrant commented 4 months ago
  • Do we consume the "fake" formatting month code M06L for hebrew

The Hebrew leap month Adar I is M05L, not M06L. Details: https://github.com/unicode-org/icu4x/issues/3821#issuecomment-1978120761

Or do you mean M06L to imply "Adar II in a leap year" for formatting? This seems admittedly like a slippery slope of complexity. Should Temporal treat M06L as valid, i.e. an alias for M06? Will Temporal.PlainYearMonth be expected to play back M06L from its monthCode getter? How will users know which APIs accept M06L and which don't?

Sorry for not keeping up with the latest ICU4X work, so these might be obvious questions!

Manishearth commented 4 months ago

The Hebrew leap month Adar I is M05L, not M06L. Details: #3821 (comment)

Yes, I'm talking about the formatting code.

Or do you mean M06L to imply "Adar II in a leap year" for formatting?

Yes.

We need to have some identifier for this in our data model. That is an immovable constraint of needing to be able to index this data somehow.

M06L is an obvious choice on the table, I'm happy to pick something else like M06A or something, though I'm not sure what we'll do about stability here (cc @sffc? Stability not a problem with the neo data, anyway.).

Should Temporal treat M06L as valid, i.e. an alias for M06?

No.

This is an identifier used for ICU4X APIs dealing with ICU4X data. The FormattableMonth/FormattableYear APIs are for instructing ICU4X formatters what to do. Temporal implementations shouldn't use these. We can document FormattableMonth as having these quirks.

Now, one open question on the table as stated before is if M06L should be accepted by ICU4X in other contexts. I could go either way but at this point I lean towards "no". I think we should always accept era aliases, but month code aliases are weird and mostly only a formatting thing.

sffc commented 4 months ago

The reason we opened this issue was that we don't want to return M06L in the main APIs that Temporal uses. We will end up with something like .month() that returns the "correct" code and .formattable_month() or .input_month() that returns the "display name lookup" code.

As far as we're concerned, Hebrew could decide to index the display names differently: for example, 0-11 for the regular months, 12 for Adar I, and 13 for Adar II. This would probably be a more efficient representation. Separating .month() from .formattable_month() allows us to make these types of choices.

justingrant commented 4 months ago

Thanks for the context. Are there other known calendars besides Hebrew where the formatting string for a particular month code will vary based on the year?

What about locales? Are there locales that use a particular calendar where the localized name of the month varies based on the year?

The case I had in mind would be calendars where there are "holy months" in some years, or otherwise have months with variations where you need to know more than just the month code to know what string to use for formatting.

Is this a public API exposed by ICU4X to Rust developers? Or is this formatting month code purely an internal implementation detail inside ICU4X?

Separating .month() from .formattable_month() allows us to make these types of choices.

A separate API makes a lot of sense to me.

if M06L should be accepted by ICU4X in other contexts. I could go either way but at this point I lean towards "no". I think we should always accept era aliases, but month code aliases are weird and mostly only a formatting thing.

This sounds right. If you do end up supporting M06L in other public APIs, then Rust developers will have two almost-identical naming schemes that only break for one month in the Hebrew calendar. The likely result of that will be that lots of Rust apps won't work in Hebrew because unless developers think to test in Hebrew, they'll just ship their apps without knowing about the M06L trick and not even knowing that the trick is needed.

If ICU4X does indeed use M06L, then it does make me wonder if Temporal-on-ICU4X should just accept that code, just like aliases for eras. I don't see much harm in it, and it'd prevent crashes if Rust developers write Temporal code later.

sffc commented 4 months ago

Are there other known calendars besides Hebrew where the formatting string for a particular month code will vary based on the year? What about locales? Are there locales that use a particular calendar where the localized name of the month varies based on the year?

Good point; we should probably move this function to LocalizedDateTimeInput so that the locale can customize the display name index.

https://unicode-org.github.io/icu4x/rustdoc/icu_datetime/input/trait.LocalizedDateTimeInput.html

Even in this case, though, the locale needs a mechanism for unambiguously referring to a specific display name.

justingrant commented 4 months ago

the locale needs a mechanism for unambiguously referring to a specific display name.

Yep.

How much do you care about the this unambiguous identifier being fixed-length?

Because if it's really only that one Hebrew month, then you could always support an arbitrary suffix syntax defined by the calendar, e.g. M06-adar or M06-adar2 to handle that weird case, and then come up with other syntaxes for locale overrides or other calendars' weird month localizations.

sffc commented 4 months ago

I'm actually warming up to the idea that we refer to display names by an integer index 0-13 instead of using month codes for that purpose.

Manishearth commented 4 months ago

Are there other known calendars besides Hebrew where the formatting string for a particular month code will vary based on the year?

Debatably, the Hindu calendar with merged/deleted months but that's incredibly weird anyway.

Is this a public API exposed by ICU4X to Rust developers? Or is this formatting month code purely an internal implementation detail inside ICU4X?

It has to be: the formatting and calendar implementations are separate crates, so if everything used by formatting must be publicly accessible.

then Rust developers will have two almost-identical naming schemes that only break for one month in the Hebrew calendar.

I don't really see this as as big a problem provided we document formattable_month correctly as "this is primarily useful for working with icu_datetime data". I don't see many ICU4X users invoking that method.

Because if it's really only that one Hebrew month, then you could always support an arbitrary suffix syntax defined by the calendar, e.g. M06-adar or M06-adar2 to handle that weird case, and then come up with other syntaxes for locale overrides or other calendars' weird month localizations.

Implementation constraints mean that month codes must be up to four ASCII characters. We could do M06A for alternate.


Good point; we should probably move this function to LocalizedDateTimeInput so that the locale can customize the display name index.

We can't, since LocalizedDateTimeInput is from the other crate. I would rather not introduce a doc(hidden) for this.

I'm actually warming up to the idea that we refer to display names by an integer index 0-13 instead of using month codes for that purpose.

I think this might make our data more inscrutable, which isn't great. It does simplify indexing though. Hebrew is the only calendar using LeapLinear.

justingrant commented 4 months ago

Implementation constraints mean that month codes must be up to four ASCII characters. We could do M06A for alternate.

If you're limited to 4 chars, then I'd be OK with M06L, especially if Temporal-on-ICU4X could silently accept M06L as an alias for M06 instead of throwing.

Anyway, now that I think I understand the context and the constraints, here's my thoughts:

Manishearth commented 4 months ago

but there could be for the Hindu calendar

I don't think there would be.

Can we re-use that same logic to resolve formatting code => month code aliasing?

Yeah, probably.

If formatting month codes don't look like Temporal month codes, then no worries from me. Any syntax is OK. Numbers seem fine too. (and only 2 bytes long!)

Seems good. We'll stick to M06L but if Temporal wants to spec this I'm fine with picking anything Temporal picks.

sffc commented 4 months ago

If formatting month codes don't look like Temporal month codes, then no worries from me. Any syntax is OK. Numbers seem fine too. (and only 2 bytes long!)

Seems good. We'll stick to M06L but if Temporal wants to spec this I'm fine with picking anything Temporal picks.

I think this is what @justingrant wanted to avoid: M06L "looks like" a valid Temporal month code for Hebrew, but it's not. By using some other syntax, or integers, for the display names, we avoid the ambiguity.

Manishearth commented 4 months ago

Ah, "doesn't look like a month code"

justingrant commented 4 months ago

Yeah, my point is that IMO it's OK if the formatting code looks like a regular month code, but if it does then we should make sure that Temporal will accept it as an alias without crashing.

Manishearth commented 4 months ago

I'm still not convinced that this ambiguity will actually be a problem for users given that it'll show up for an API external users will basically never use. But if Temporal is willing to do this I have no problem with that