Separation of language and formatting locale

MickMonaghan commented 4 years ago

Is your feature request related to a problem? Please describe. Some message formatting APIs only allow you to make a single choice with respect to language/locale. This single choice is used for both string retrieval and for the formatting of any locale sensitive placeholders within those strings. I'd like to see the ability to provide two independent language/locale choices to the API - one used for string retrieval, the other for locale sensitive formatting.

Describe the solution you'd like API accepts UI language & formatting locale as independent variables.

Describe why your solution should shape the standard Architecturally, UI string retrieval, and placeholder formatting are completely separate functions. We should implement it that way. If API consumers do not want to expose this flexibility to their users - that's fine, they don't have to. Where a locale formatting choice has not been provided, we can fallback to the language choice.

Additional context or examples This may result in mixed language UIs - where a string is translated per the language choice, and say a date placeholder is formatted according to the locale choice. That's fine - so long as the user's expectations are properly managed, and they are not surprised.

Most OSs allow for this type of separation.

zbraniecki commented 4 years ago

If we want to respect POSIX, we may want to separate different fallback locale chains for different formatters (LC_TIME vs LC_MESSAGE vs LC_NUMERIC vs LC_COLLATE etc.)

I'd also like us to carry chains, never single locale, to allow for better fallbacking.

nbouvrette commented 4 years ago

Could this be solved by a very flexible fallback as well? Referring back to @mihnita's comment:

Basically once you understand that there is no fallback across scripts (so no zh-TW / zh-CN, or sr-Cyrl / sr-Latn), things "just work", no need to think about it.

mihnita commented 4 years ago

I've always though that the POSIX model "slices" things the wrong way. Why would I want time and numeric to use different locales? Or if the numeric is Arabic (with "real" Arabic digits) and time is French, what kind of digits would I use? On the other side is missing more important sides (speech vs written text, for example)

In general I consider the POSIX model outdated, and I would not look at it for a model.

zbraniecki commented 4 years ago

Why would I want time and numeric to use different locales?

People have weird preferences esp. around date/time. They want German translation but with en-US date/time or reverse.

Or if the numeric is Arabic (with "real" Arabic digits) and time is French, what kind of digits would I use?

Great question. My assumption is that you format using eastern arabic numerals and French date/time pattern.

In general I consider the POSIX model outdated, and I would not look at it for a model.

I tend to agree. At the same time, I don't think it would be terrible for us to consider allowing people to specify some locale fallback chains that are intended for particular formatters.

mihnita commented 4 years ago

They want German translation but with en-US date/time or reverse.

I've never seen that. Who would want to see "Remember that you have an appointment on Samstag, 15. Februar 2020" What I've seen are "mixed" preferences for 12h / 24h, the day-month-year order, first day of the week, what is week-end, etc. Or (something that I don't think even ICU can do) force the order of the date fields (en-US, but with d MMMM, y dates)

Stuff that is really covered under the -u- extension, not the "locale proper" Sure, it is still technically locale... we can put it that way :-)

Great question. My assumption is that you format using eastern arabic numerals and French date/time pattern.

And what about the "am/pm"? Would that be French (am/pm), or Arabic (ص / م)? :-)

Mixing things can be pretty dangerous. Imagine I see a Danish date (dd.mm.(yy)yy) in the middle of an Italian message (Italian use . as a time separator, so 11.12.20 is 11:12:20 am)

What I've seen working relatively well is a "separation" the keeps the lang+script the same between messages and formatters. So you can have es-CL-u-hc-h24 (Spanish-Chile, 24h format) for formatting and es-419 (Latin-American Spanish) for messages.

Or ar-MA-u-ca-islamic-fw-sun-nu-latn-tz-uslax formats with ar-001 messages (for a Moroccan Arab living in California (uslax time zone) who wants the Islamic calendar, with Latin digits and Sunday as first day of week)

This is why Android splits things in two steps: language negotiation followed by fallback.

Negotiation happens once, when the application starts, going through the full list of user languages (an intersection between the locales I say I understand and the ones that have localized resources).

Language fallback happens for every single attempt to load resources. Resource loading fallback staying "in the same language", for instance es-CL-u-hc-h24 => es-CL => es-419 => es-MX => es-US => es => es-ES => es-* => root (es-MX and es-US are legacy, from a time when Android did not support 3 digit region codes). You might end up with a es-CL license agreement, es-419 messages, some es images, and root for other images, and styles. But will never end up with Portuguese strings, even if it was specified it in the list of locales that the user understands.

Maybe not ideal, but reduces (eliminate?) confusion.

asmusf commented 4 years ago

On 2/14/2020 5:15 PM, Mihai Nita wrote:

I've always though that the POSIX model "slices" things the wrong way. Why would I want time and numeric to use different locales?

The question to ask is: why would you make that impossible? I mean, perhaps the answer is that this is rare enough that you can require the definition of a one-off "locale" that combines the two formatting options.

That would have the advantage that you could explicitly resolve issues like:

Or if the numeric is Arabic (with "real" Arabic digits) and time is French, what kind of digits would I use?

As to who would want to do something like that, I can't speculate, but the kinds of scenarios that I could imagine would not be localization of consumer products, but perhaps something that's an in-house app for some large organization? Who knows. But the minute you rule something out altogether, you have to prove a negative, I think.

Doesn't have to mean that your model should assume such mixtures as the standard scenario.

mihnita commented 4 years ago

why would you make that impossible?

Because it adds complexity to the spec & implementation with no good benefit. Worse, it makes things more error-prone.

Basic API design: "Easy to use correctly, hard to use incorrectly" See cases 2 and 3 here: https://github.com/unicode-org/message-format-wg/issues/43#issuecomment-586635081

For extreme cases you can create your own formatter outside the MessageFormat with whatever locale you want.

But MessageFormat only needs 1 or 2 locales:

to create the formatters as needed
to load strings refereed by a message (I would argue that this does not belong in MessageFormatter itself, but in something like a ResourceMager, that can be "bound" to the MessageFormatter)

mihnita commented 4 years ago

Another argument against the POSIX style: where do we stop?

Why LC_TIME, but not LC_DATE?

And what about LC_LIST_FORMAT, LC_DURATION, LC_INTERVAL, LC_MEASUREMENT, LC_RANGE, LC_PHONE_NUMBER? What about my own data types, where I can provide a formatter (for instance FormatPersonName that would return "Mihai" for me, but "Mr. Tanaka" for a Japanese)?

nbouvrette commented 4 years ago

Another argument against the POSIX style: where do we stop?

+100

I think it's very important to set boundaries as there are cultural conventions that simply do not apply in other languages. I think that for these cases it might be simpler to have language-specific strings without trying to solve this with the syntax itself.

zbraniecki commented 4 years ago

I've never seen that.

You may want to skim through, list of bugs like:

Generally speaking there are three groups of our users who are seeking some customization:

1) Group that wants translation in locale X and date/time in locale Y (German Fx with en-US date/times) 2) Group that wants translation in locale X and date/time in a different regional preferences of the same locale (en-US Firefox with en-AU date/time) 3) Group that wants some form of manual date/time patterns that they manually selected in their OS (dateStyle long to be of pattern X)

The last group was my reason to introduce dateStyle/timeStyle proposal because that allows me to override it from some customization from OS. Group (2) vs (1) was the reason in Gecko I use regional preferences locales only if the language of that locale matches the language of the translation, see https://firefox-source-docs.mozilla.org/intl/locale.html#regional-preferences

Group (1) is the biggest outlier, and they're a pretty vocal minority who likes the inconsistency that I described in the documentation (Today is 24 października).

If there's a way to make our API accept some override for locale of date/time only, I think we'd allow for the flexibility needed for all those groups.

Another argument against the POSIX style: where do we stop?

In my experience - date/time. I haven't seen any requests for anything else.

mihnita commented 4 years ago

I think that group 1 can be accommodated with something like creating a formatter and setting it for a certain placeholder:

mf = new MessageFormat("Today is {theDate}", "en");
df = new DateFormat(..., "pl");
mf.setFormatter("theDate", df);

The idea would be to make the most common cases very easy to support, without preventing one from supporting the "weird cases" if they want. But I would not bend over to make it as friendly as the defaults. You make all things equally easy then the developer has no good guidance on what is "the right thing"

asmusf commented 4 years ago

On 2/17/2020 12:29 PM, Mihai Nita wrote:

||

The idea would be to make the most common cases very easy to support, without preventing one from supporting the "weird cases" if they want.

That sounds like the approach I was advocating, although the devil is in the details.

If some (vocal) minority of users wants some capability, is that going to impact coding for every single message, or is this something that can be managed more globally for an app, for example. I wasn't quite sure from your example.

But I would not bend over to make it as friendly as the defaults. You make all things equally easy then the developer has no good guidance on what is "the right thing"

Having an obvious "right thing" is useful. But from some of the examples given so far, it looked like there were competing ways of getting the correct/same result. You may still need to publish good guidance.

zbraniecki commented 4 years ago

I agree with @mihnita - good defaults should be rewarded with easy API use. Outliers may be enabled and their developer UX doesn't have to be pretty. For the case of group (1) we could for example do sth like:

let bundle = new Bundle(["de", "en-US"]);

bundle.setFormatterLocaleChain("DATETIME", ["pl", "ru", "en-US"]);

bundle.formatPattern("Today is { $date }", {
  date: new Date()
});

mihnita commented 4 years ago

although the devil is in the details

Isn't that always the case :-)

Having an obvious "right thing" is useful. But from some of the examples given so far, it looked like there were competing ways of getting the correct/same result. You may still need to publish good guidance.

True, probably guidance is a must no matter what. But the APIs can also encourage one to do the right thing.

Taking the 3 ways of doing the same thing here: https://github.com/unicode-org/message-format-wg/issues/43#issuecomment-586635081

I think that case 1 is the easier one to read and write. It is also the "best recommended practice", and not "it happens to", but by design.

I think (but this is just my opinion) that Zibi's example above (with setFormattersLocaleChain) makes it too easy to do the wrong thing.

If you want to do something weird it should not be easy, but discouraged by guidelines. Should be hard(er), you should jump through (some) hoops :-)

dchiba commented 4 years ago

It is essential for an internationalized application to respect the local format that the user is accustomed to. This is why all OSs have a locale setting that is used in formatting a date/time/number value by default. Separation of language and formatting locale is required just to meet this basic requirement.

Let’s consider an international shipment tracking by a German user, whose package was shipped from a US based company. The German user comes to the US company’s website to find out when his package was shipped. The message template could be:

Ship date: {shipDate,date,short}

If the website supported German, this message should be presented fully in the German convention using a German template and the German formatting locale:

Versanddatum: 18.3.2020

If German was not supported, the German user must use an alternate UI language. Let's say it's English and the German user may prefer:

Ship date: 18.3.2020

to:

Ship date: 3/18/2020

Notice that the message string is English, while the date format matches the German locale preference. This behavior is consistent with the way all OSs behave by default for localizing date, time, number and any other locale sensitive conventions.

Some applications fail to respect the user’s formatting locale because they negotiate the UI locale once and then apply it to all locale sensitive operations. That is a problematic practice. Say, an application shows news articles from international sources in various languages. Then the articles should be filtered based on the user's acceptable languages. If the same article was available in different languages, then the best available language should be used, regardless of the application's UI language. If an application took the negotiate-once-use-it-everywhere approach, the user experience can be significantly degraded due to the failure to respect the user's locale preference.

It is also noteworthy that many applications don’t support as many languages as they wish, so the UI language could be chosen from the user’s second languages or simply the default, which is often English, and the English locale is based on (fairly unique) American conventions. This is another reason formatting locale needs to be identified separately from language.

The message template could include the number of packages, in which case it would be essential to handle pluralization based on language, while formatting the number or datetime in formatting locale.

Ship date: {shipDate,date,short} ( {count,number, plural one {# package} other {# packages}} )

In US locale, "1,000 packages" is expected, but German users would prefer "1.000 Pakete" (Notice the comma has changed to a dot.) or "1.000 packages" if German translation was unavailable.

Similarly, there are other reasons to separate language and formatting locale. Indian numbering system is preferred for Indian users, Thai calendar year is expected for Thai users, to name a few.

In summary, local conventions should be honored, regardless of the application's UI language. (I don't mean to uphold the POSIX model. I advocate separating language and formatting locales and negotiating as many times as needed.)

zbraniecki commented 4 years ago

If German was not supported, the German user must use an alternate UI language. Let's say it's English and the best output for this user would be:

Ship date: 18.3.2020

The position that this (english string, german date) is not universally agreed upon among this group. Please, don't present it as if it was.

In particular, our experience at Mozilla (see my comment above with the list of bugs) is that this is a very fragile and subjective area of UX where we likely cannot design a "perfect solution for everyone", and multilingual users (not even in fallback scenarios) will strongly differ in their preferred outcome.

To illustrate it using your example, the difference between March 6th 2020 and June 3rd 2020 can be impossible to deduct from the formatted string ("03/06/2020" vs "06/03/2020") and depend on the language/locale. In such case, users may attempt to deduct the element order from the surrounding string and misread the date. It happened to us in the context of error certificates UX.

For that reason, I'd argue that at the edge there are two types of users - users who want {english string, german date} and users who want {english string, english date}. Without some flexibility of the API, we can't cater to both.

Another area of possible confusion are unit names. I don't remember it off hand, but there are some cases where a unit symbol in one locale is the same as another unit's symbol in another. In such case, presenting {english string, french unit} may be indistinguishable from {english string, english different_unit}.

And finally, short abbreviation of weekday names overlaps very often, so Your alram is on for: mo, tu, fr can be very confusing if week day names are formatted in a different locale then the preceding message.

Some applications fail to respect the user’s formatting locale because they negotiate the UI locale once and then apply it to all locale sensitive operations. That is a problematic practice.

Agree. Hence my suggestion to always carry fallback chains so that on each level negotiation can be performed.

The message template could include the number of packages, in which case it would be essential to handle pluralization based on language, while formatting the number or datetime in formatting locale.

Also agree. An issue we currently face at Mozilla is that our JS engine has one "default locale" which has to be used for date format and for pluralization, and we are working our way toward separating those for that reason.

In summary, local conventions should be honored, regardless of the application's UI language.

Agree. I think this is an area of customization that we should aim to provide good-enough defaults, but recognize our inability to provide perfect defaults, and allow for alternative models via options.

asmusf commented 4 years ago

There's also the issue of websites presenting units based to local custom based on location of the accessing device.

Weather forecast may be shown in Fahrenheit before going on a trip in Europe, and in Celsius while there (same European website).

Can't remember whether that went with change in UI language for the website or not.

dchiba commented 4 years ago

The position that this (english string, german date) is not universally agreed upon among this group. Please, don't present it as if it was.

I agree it's misleading. It's revised to "the German user may prefer: Ship date: 18.3.2020 to: Ship date: 3/18/2020".

In some cases, it is desirable to use the same locale for both language and formatting locale, as Mihai raised in his example. On the other hand, there are other cases in which it is desirable to separate them and I would like this standard to support all common cases that are known to expect using the same locale at times and different locales in other times. I agree to provide good-enough defaults with flexibility to use alternates via options.

In my understanding, desired units are generally deducible from the user's home locale. The application's context could call for a special handling, in which the default convention should be overridden. For instance, it may be appropriate for a mobile whether forecast app to show both Fahrenheit and Celsius if the current location's convention is different from the one the user is accustomed to. As Zibi noted, the desired behavior may depend on the surrounding elements.

I think the API should allow the application to optionally specify fine elements of the user locale to cater for the personal preferences that could be different from the locale defaults. For good user experience it is very important to use the local conventions that the user is accustomed to. Respecting user's home timezone is particularly important (#25) because it is impossible to deduce it from the locale and it is often unacceptable to present a date or time in a wrong timezone.

dchiba commented 4 years ago

Apache MyFaces is an excellent example of Web application framework that meets this requirement.LocaleContext class provides getFormattingLocale() as well as getTranslationLocale().

aphillips commented 1 year ago

This appears like it might be duplicated by #426 ?

aphillips commented 1 year ago

Closing in favor of #426

unicode-org / message-format-wg

Separation of language and formatting locale #29