Semantics of locales argument to MessageFormat

jkrems commented 2 years ago

It seems that MessageFormat is fundamentally different from things like NumberFormat because the message text argument is very much locale-specific but doesn't participate in option resolution. Would it be more meaningful to either only accept a single locale in the options or to allow specifying different message templates based on which locale is supported by the host environment?

zbraniecki commented 1 year ago

I don't think i understand your question. You pass locale list that is used to resolve formatters used by the message. If you want it to be just one locale, you can pass one locale. (And realize that if it doesn't resolve intl API will fallback on default - so it still is a two element list just that second locale is implicit)

eemeli commented 1 year ago

MessageFormat is indeed different from the other Intl formatters in that it's not itself a consumer of the locale information passed to it. However, this makes it in fact even more important for its locales to support multiple locales for fallback.

Consider for instance some point in your code where you're currently calling new Intl.NumberFormat().format(), and then including the result in the UI presented to the user. Why should the locale information passed through here be different depending on whether the call is directly from JS code, or via a MessageFormat number formatter? Exactly the same concerns are shared in both cases.

jkrems commented 1 year ago

What I meant specifically are cases where the selected locale directly influences the interpretation of the message itself. For example:

match {$count}
when one {You have one item in your cart for a total of {$totalAmount}}
when * {You have {$count} items in your cart for a total of {$totalAmount}}

const source = ... // string source of the message as above
const mf = new Intl.MessageFormat(source, ['en-CA', 'fr-CA']);
const notifications = mf.resolveMessage({ count: 1, totalAmount: 20 });
notifications.toString(); // 'You have one item in your cart for a total of 20'

// BUT
const notifications = mf.resolveMessage({ count: 0, totalAmount: 0 });
// If locale ended up resolved to en-CA: "You have 0 items in your cart for a total of 0"
// If locale ended up resolved to fr-CA: "You have one [sic!] item in your cart for a total of 0"
notifications.toString();

In other words: Formatting a message written for one locale with a different resolved locale (especially for plural rules) can create very bad artifacts, up to rendering wrong values. I've seen these kinds of bugs in real code. Most developers and translators I've worked with aren't familiar with the finer details of cross-locale formatting issues (understandably).

zbraniecki commented 1 year ago

In other words: Formatting a message written for one locale with a different resolved locale (especially for plural rules) can create very bad artifacts, up to rendering wrong values. I've seen these kinds of bugs in real code.

That's correct, but that is the nature of language fallbacking. What your second snippet states, is analogous to as if the developer passed:

const source = ... // string source of the message as above
const mf = new Intl.MessageFormat(source, ['fr-CA']);
const notifications = mf.resolveMessage({ count: 0, totalAmount: 0 });
// If locale ended up resolved to en-CA: "You have 0 items in your cart for a total of 0"
// If locale ended up resolved to fr-CA: "You have one [sic!] item in your cart for a total of 0"
notifications.toString();

Your suggestion to only allow for a single locale feels to me like a common fallacy that I observe within the localization system's community - the idea that a single locale is better because it doesn't allow for fallback. My claim is that this is true only if you also say that if such locale is not available for any of the subsystems, then the API will crash/error out:

const source = ... // string source of the message as above
const mf = new Intl.MessageFormat(source, 'en-CA'); // <-- notice single locale
const notifications = mf.resolveMessage({ count: 0, totalAmount: 0 }); // <-- here PluralRules didn't have data for `en-CA`.
assert(notifications, undefined); // or throw exception

And I have never seen such system. API designers therefore first limit fallback to avoid mistakes, and then relax the error scenario to avoid exceptions.

If you, therefore, are not ready to error out on misalignment, then, IMO, you are in fact always operating on fallback lists. Your single locale case becomes a list of [locale, DEFAULT_FALLBACK].

The only difference between such case and what we're proposing is that in our case you allow to build [locale1, locale2, locale3, DEFAULT_FALLBACK] model.

Assuming you agree with my position, my recommendation is to start using chained language negotiation. You should never pass fr-CA as an available locale to the snippet you provided, since fr-CA is not a good match for the locale of the message.

What should happen depends on how we design internal language negotiation of MessageFormat.

There are generally two options (with some nuance).

Option 1 is that internal matching is "simple". In such model if I pass en-CA and we have PluralRules in [en-US, en-GB, en] we will not match, since we just check if en-CA exists.

If is the case, then as a consumer of our API you need to get the right fallback:

const messageLocale = `en-CA`;
const availablePluralLocales = Intl.PluralRules.availableLocales();
const supportedLocales = negotiateLanguages(messageLocale, availablePluralLocales);
if (supportedLocales.length === 0) {
  // handle lack of any match
}
assert(supportedLocales, [en-US, en-GB, en]);

const source = ... // string source of the message as above
const mf = new Intl.MessageFormat(source, supportedLocales);
const notifications = mf.resolveMessage({ count: 0, totalAmount: 0 });

In this case we "moved" the negotiation completely to the customer side.

There's a "mixed" approach model, where some level of negotiation happens internally. Where and how is debatable, but we'll need to make such decision for all transitive Intl APIs. MessageFormat is just the first and likely most complex chained Intl API.

In the mixed model, for example, we could say that lang-script pair has to be provided, but everything else is matched in a relaxed mode.

In such scenario, it's enough that you pass en-CA and since PluralRules have [en-US, en-GB, en] any of the three can match, so internally the first of the three will be used (and if the negotiation will return en first, then this one will be used).

To visualize the problem, I think it's better to not use en in the case. Because we all implicitly assume that lastFallback is en-?.

So, what if the message is in de-AT. And PluralRules do not have de-AT? It may have other de-*, and it may match it internally, but what if it doesn't have any de-*?

If we do what you suggest, we end up either erroring out, or falling back on some internal lastFallback - probably English, right?

In the model we propose, we allow the customer to inject additional better fallback between preferred locale and this implicit last fallback, or erroring out.

For example, if the message has a date, we could say "Customer speaks french and german, the message is in French, so please, try to use french to format the date, but if you don't have any french, please, format the date to german, and embed it in this french message. If you don't have either french or german, then do X" where "X" means to either use last fallback, or error out.

Does that make sense?

In other words, instead of asking if fr-CA plural rules in en-CA message lead to a proper message, think of it locale2 explicitly provided is a better fallback then implicit last fallback or better than erroring out. My argument is that it is, and we should allow developers to provide them.

And if your solution is to use implicit last fallback, then such locale2 may be better than last fallback which is static, language independent and uncontrollable.

eemeli commented 1 year ago

Besides the language fallbacking mentioned by @zbraniecki above, the original message you propose is also a bit problematic:

match {$count}
when one {You have one item in your cart for a total of {$totalAmount}}
when * {You have {$count} items in your cart for a total of {$totalAmount}}

Specifically, the one case here should be a 1 case, as it explicitly refers to "one item" rather than e.g. "{$count} item". Using an exact numeric match would provide a more appropriate result when fallbacking to a different locale.

jkrems commented 1 year ago

Specifically, the one case here should be a 1 case, as it explicitly refers to "one item" rather than e.g. "{$count} item". Using an exact numeric match would provide a more appropriate result when fallbacking to a different locale.

Definitely! In the past, I literally added hard validations to some i18n tooling I maintained to force {$count} and reject translations that omitted them. But unless MessageFormat enforces that invariant, people will do this "incorrectly". In fact: I copied the pattern from this proposals own README (!) which should demonstrate that even people with lots of knowledge around i18n/l10n get this wrong.

My primary argument here is that creating the appearance that MessageFormat and NumberFormat locale fallbacks are equivalent is dangerous because they are not. One can lead to actively wrong information, the other "just" to slightly misleading formatting of correct information. There's definitely a variety of solutions to this, starting with adding big warning signs that the behavior of things like match {$count} is effectively undefined when the preferred locale(s) aren't available. Requiring that {$count} is present is another solution that at least catches the worst failure modes. Then there's having separate fallbacks for "message-affecting" and "message-inlined" locales or allowing the user to provide different message templates, depending on which fallback locale is picked.

I'm sure there are others (and yes, erroring if none of the explicitly requested locales provides message-affecting data like plural rules is one of them). But as the README of this proposal proves (to me at least), assuming that nobody makes the mistake of baking assumptions about exact plural rules into the message template is a questionable bet.

zbraniecki commented 1 year ago

In fact: I copied the pattern from this proposals own README (!) which should demonstrate that even people with lots of knowledge around i18n/l10n get this wrong.

That's a great point. I also want to suggest that we should fix it. PRs welcome :)

Then there's having separate fallbacks for "message-affecting" and "message-inlined" locales or allowing the user to provide different message templates, depending on which fallback locale is picked.

I think I'm leaning toward such solution. In our lingo we would call it "selector" vs "formatter". Formatter should operate the way other formatters do, but we should take your feedback and evaluate how selector locale fallback should work.

assuming that nobody makes the mistake of baking assumptions about exact plural rules into the message template is a questionable bet.

Please, do not imply what has not been stated. Nobody claims that.

My primary argument here is that creating the appearance that MessageFormat and NumberFormat locale fallbacks are equivalent is dangerous because they are not.

I agree with your problem statement. I disagree with your solution proposed. My argument against the latter is that using a single locale is an illusion, and you implicitly fallback on another anyway. Which may have a different language from the one you requested and leads to the same problem as you described. Hence, we either need to resolve the selector locale fallbacking, or error out on mismatch.

eemeli commented 10 months ago

Revisiting and re-reading this, I'm tempted to conclude that we should continue to support an array locales value, in order to support locale fallback chains like en-CA → en-GB → en, i.e. where an automated fallback en-CA → en could be suboptimal. I agree that there are differences in how fallback could cause greater problems in selectors compared to formatters, but there too being able to customize the fallback can be at times really useful.

Also, as relevant prior art, Intl.PluralRules accepts an array of locale codes, and its primary use is to empower selectors.

tc39 / proposal-intl-messageformat

Semantics of locales argument to MessageFormat #18