Support variable info not in message patterns

unicode-org / message-format-wg

Developing a standard for localizable message strings

Other

236 stars 34 forks source link

Support variable info not in message patterns #98

Closed echeran closed 1 year ago

echeran commented 4 years ago

Sometimes, a variable piece of information that affects the translation or formatting of a message pattern may not be naturally represented as a regular "printable" placeholder -- a placeholder that occupies a position within the message pattern. This information may be known at the time the message pattern is created, so we should represent it as part of the message somehow. I think that issue #33 about denoting whether a message is for print or for speech might be a good example of this.

I think we can store this type of information at the level of the message, but outside of the message pattern in which the "printable placeholders" occur. More specifically, I think we can re-use the concept of placeholders and treat these placeholders as "non-printing". Doing so could help during the selection phase of a multi-select message, esp if the message already has printing placeholders (ex: when {MEDIUM=SPEECH, COUNT=MANY}, the multi-select might return ["Hey y'all, " {COUNT} " is a boatload."] ).

This decision would help inform the shape of the data model.

More context on the text/speech problem is in issue #33 filed by @grhoten. I think @mihnita may have had ideas of other examples.

echeran commented 4 years ago

This issue came up as a question from @mihnita during today's meeting. Feel free to correct anything that I didn't capture quite right.

aphillips commented 4 years ago

A common enough example of this in the current MessageFormat is the SelectFormat: an enumerated data value is used to select between messages (gender is a common example, but it can by anything really--which is its own problem). In our (Amazon's) proprietary format we force these to use "complete thought" strings (which can be nested, including interleaved with plural). Something like:

      "someRandomMessageId": {
        "param": "messageType",
        "selectItems": {
          "email": "Hello {name}, you have new emails in your inbox.",  
          "notification": "Hello {name}, you have new notifications in your inbox.",
          "other": "Hello {name}, you have new items in your inbox."
        }

mihnita commented 4 years ago

Yes, my example was for gender, but can be anything.

To me this can be modeled best using a construct similar to the switch in programming languages:

switch([array_of_conditions]) {
   case [array_of_values]: message
   case [array_of_values]: message
   case [array_of_values]: message
   default: message // this is for [other, other, ..., other]
}

For example:

"someRandomMessageId": { // Message
    switch: [GENDER(HOST_GENDER), PLURAL(GUEST_COUNT)]
    cases_map: {
        [female,     1] : "{host_name} invited only one guest to her party" // SimpleMessage
        [female, other] : "{host_name} invited {guest_count} guests to her party" // SimpleMessage
        [male,       1] : "{host_name} invited only one guest to his party"
        [male,   other] : "{host_name} invited {guest_count} guests to his party"
        [other,      1] : "{host_name} invited only one guest to their party"
        [other,  other] : "{host_name} invited {guest_count} guests to their party"
  }

I think this is instantly familiar to any programmer.

If we adopt it because of this, then it is clear what goes where:

The parameter names (HOST_GENDER, GUEST_COUT) + selection type (GENDER, PLURAL) go Message level, not in the keys
The default is a complete tuple of [other...], and it is a fixed convention. In programming we don't don't somehow tag a case as default, we have a default "branch"

It is also easy to do algorithmically check that the missing cases are and add them in languages that need them (by copying from the [other...] case)

Taking a language with no gender and no numbers as source (let's say Chinese) one can create this minimal message:

"someRandomMessageId": {
    switch: [GENDER(HOST_GENDER), PLURAL(GUEST_COUNT)]
    cases_map: {
        [other,  other] : "{host_name} invited {guest_count} guests to their party"
  }

Because we know the the selection types (GENDER, PLURAL) it is easy to determine that the cases to add will be all the combinations of plural cases (language dependent) + gender cases (also language dependent). (plural English will require [one, other], the Russian [one, few, many, other], etc)

Using the parameter + value as keys ({MEDIUM=SPEECH, COUNT=MANY}) it means we can mix and match, and the selectors are not "consistent":

{MEDIUM=SPEECH, COUNT=MANY}: msg1
{COUNT=ONE}: msg2
{MEDIUM=TEXT}: msg3
{HOST_GEN=FEMALE}: msg4

What are the missing cases is missing in the message above?

In "traditional programming languages" this is a collection of if ... else if ... else:

if (MEDIUM==SPEECH && COUNT=MANY) {
    return msg1;
} else if (COUNT==ONE) {
    return msg2;
} else if (MEDIUM=TEXT}: msg3 {
    return msg3;
} else if (HOST_GEN==FEMALE) {
  return msg4;
}

Most linters can check switch constructs and report missing cases (for enums), or missing default. None can detect and report cases not properly covered by a chain of if else if else

asmusf commented 4 years ago

In the example of the "switch" the array of conditions is in the opposite order of gender/count as the "array of values". Oversight? Intentional?

eemeli commented 4 years ago

I think the default-case handling that @mihnita mentions is a different discussion from the rest of this thread?

As I understand it, what we're talking about here is being able to determine a message based not only on its own input parameters, but also on other values. From a developer's point of view, they have some process by which they acquire a function that returns a message:

message({ name: 'Mikko' }) // 'Hello Mikko'

With MF1, all such parameters need to be directly given to the function. But could we add a stage at which common parameters may be defined, that are also available as parameters? Using the example of @aphillips:

const messages = getMessages({ name: 'Mikko' })
const msg = messages['someRandomMessageId']
msg({ messageType: 'email' }) // 'Hello Mikko, you have new emails in your inbox.'

I think this would be a good idea. There are of course questions about scope that need to be addressed; does the identifier used in the message need to make it clear whether the variable is coming from the immediate parameters, or a wider scope? Can a message function be called with a parameter that masks a scope parameter?

I also think that this provides a decent argument for the AST's root not to be a single message, but some form of resource object that can contain not only one or more messages, but also identifiers for expected scope parameters.

zbraniecki commented 4 years ago

In Fluent we had a concept of context data for quite a while, it was meant to work very similarly to what Eemeli is describing here:

let bundle = new FluentBundle("en", {
  ctxData: { name: "Mikko" }
});
bundle.addResource(res1);
bundle.addResource(res2);

let msg = bundle.getMessage("key1");
bundle.formatPattern(msg.value, {
  type: "email"
}); // "Hello Mikko, you have new emails in your inbox.

It never got traction at during one of the API remodels we removed it with an intention to add back if users ask for it. The users never explicitly requested, because building second argument to formatPattern out of some ctxDataObject and per-call-arguments was easy enough. I like the feature and I'd be in support of having it because it helps bringing consistency and expose per-context data to users.

aphillips commented 4 years ago

I like the idea of contextual parameters. In practice, your code wouldn't look like the above examples. The customer's name ("Mikko") would be in a context variable or injected. Otherwise you'd just pass it explicitly in the format call (to ensure it is present). Maybe one would need some sort of guardrail to ensure that all of the contextual parameters get loaded with something.

To @eemeli's comment, I think it is hard to separate the resource format from the formatter. ICU's current message format produces an untranslateable mess when you use select or plural (and heaven help those who nest them!) Replacing this with something means making decisions about the priority of features and the structure of the resources--because sticking strictly to APIs seems quite difficult.

mihnita commented 4 years ago

In the example of the "switch" the array of conditions is in the opposite order of gender/count as the "array of values". Oversight? Intentional?

Sorry, oversight. The items in the "tuples" should match count and type.

mihnita commented 4 years ago

The trouble with context / binding is scope (in the programming language), and in general the fact that there is no easy access to those variables. In Java you can use reflection, but it is clunky. In C/C++ is even worse, the variable names are lost. So you need some special mechanisms just for this...

Many systems (Windows (Win32), Java, Android, macOS & iOS, Qt, others) store all strings in one single resource "bundle" (OK, in Java you have control, you can do what you want, but very few people create one .properties file / class) So all strings are in one single bucket, but there is no reliable way to make sure that variables references in the message are in scope. Or even if they are, there is no reliable way to "see" them inside the ResourceManager (or whatever mechanism there is in the platform doing the work). So in most cases all you can do is put everything you need in parameters.

And you can of course put in parameters everything that is useful for rendering the message, not only the visible part. So if the message looks something like this:

{host_gender, select,
   female {{host_name} ... her party.}
   male   {{host_name} ... her party.}
   other  {{host_name} ... her party.}
}

It is the developer's job to store in parameters everything needed (host_gender, host_name), visible or not.

In fact, there is a benefit in that. If you do this:

param.put("host_gender", host.gender);
param.put("host_name", host.name);

you can refactor (rename) the host in your Java code using the IDE tools and the string in resources does not have to be updated.

If you have some "magic binding" or access to the variables of the programming language then this ${host.name} ... her party. breaks when you rename the variable host.

If there is some "magic environment bucket of variables" then you can have it in a Map<...> context; And in parameters you can do param.addAll(context).

Or can have all kind of helper methods (have a Context class with createParams that gives you a map where you add some extras):

class Context extends Map
   Map newParams
      return a clone the current context map

you put all the "global" stuff that you might use in messages in context, and then you do

params = context.newParams().put("host_name", host.name).put("host_gender", host.gender);
message.format(params)

TLDR: I can't see a mechanism that works across languages to access variables. So you (the dev) have to put what you need in a "bag" Is there any value for the data model to have different "bags" for context and for parameters? Enough to affect the data model? In my opinion it is not.

From the Fluent example I don't think there is conflict, and it is very similar to what I described. The bundle is a what I called ResourceManager, and has a context "bag" attached to it. So bundle.formatPattern has access to 2 "bags" of values: the one in the bundle.context and the one passed explicitly (type: "email")

But I don't think that changes in any way the data model. The data model is the way we represent the "parsed" message pattern in memory. Then you take the date model + bag(s) of variables (parameters or parameters+context) and resolve the placeholders. Does not matter if the format gets info from one, two, or 10 bags, or can access the OS environment, or file system. It is all in "format", when the data model is already resolved (and immutable)

loadString => load the string + parses it into some kind of (immutable) data model format(data_model, bag(s) of values from all kind of places) and returns a string (or a something with ranges info, like formatToCharacterIterator in ICU)

zbraniecki commented 4 years ago

I like the idea of contextual parameters. In practice, your code wouldn't look like the above examples. The customer's name ("Mikko") would be in a context variable or injected. Otherwise you'd just pass it explicitly in the format call (to ensure it is present). Maybe one would need some sort of guardrail to ensure that all of the contextual parameters get loaded with something.

I'm not sure if I agree.

I can imagine a fairly complex UI (say, Facebook, Gmail, Firefox UI) that could have contextual information about user's gender and all l10n contexts could use that information to select the variant of any message to work with the information about user (name, gender, age, etc.)

aphillips commented 4 years ago

@zbraniecki Probably I didn't express it that well. We actually have a LocalizationContext object that we use to provide certain kinds of, well, localization-related context to the resource manager. What I was trying to say is that the context will tend to get populated and passed in as an object or reference rather than as discrete values. The values available then are guaranteed to be present, so developers can use them to code messages against without having to ensure that they provision the value. If they can't rely on the value to be present, then they tend to fall back on retrieving and passing the value in directly (indeed, the developer might not even need access to e.g. the customer identity store in their application).

eemeli commented 4 years ago

Is there any value for the data model to have different "bags" for context and for parameters? Enough to affect the data model? In my opinion it is not.

I think @mihnita is right here. This is going back a bit on my earlier comment, but from the point of view of a single message, how does a reference to a variable passed in directly differ from a reference to a reference to a context variable? Not necessarily at all.

@echeran's original example was a selector choosing a case if {MEDIUM=SPEECH, COUNT=MANY} matched. Why would it matter in the AST where the values of MEDIUM and COUNT are defined? They may well matter for the API, and the translator/localizer may well have access to a bag of variables that they know are "always" defined, but I don't think this needs to show up in the AST.

stasm commented 4 years ago

In Fluent we had a concept of context data for quite a while, it was meant to work very similarly to what Eemeli is describing here […] It never got traction and during one of the API remodels we removed it with an intention to add back if users ask for it.

I can provide some context about this. We removed it (back when Fluent was L20n) because the implementation we had was opinionated wrt. the reactivity to the mutations of the context data. It would set two-way bindings between the data and the callsites, and then re-translate the callsites when the context data was mutated. The way it worked meant it was challenging to integrate it into codebases which already had their system for managing variable bindings (e.g. MVC frameworks).

As long as the context data is immutable, the examples tend to look great :) It's important to consider the entire lifecycle, however, in particular what changes when the data is mutated.

I agree with @mihnita and @eemeli that this probably shouldn't impact the data model. My own preference would be to only allow variables passed directly to format calls in the MF2.0 API, and let higher-level userland abstractions provide some variable merging capabilities. In other words, MyLocalizationAbstraction.format("hello", {userName: "Mikko"}) would in fact call MF2's format with {...ctxData, userName: "Mikko"} (or {userName: "Mikko", ...ctxData}). MyLocalizationAbstraction can then also manage the reactivity of re-translation.

Another avenue is the one we went in modern Fluent, which supports an open list of selectors. Implementations can then define their custom selectors returning the context variables as needed. E.g. in Firefox, the PLATFORM selector returns the name of the user's operating system. This approach doesn't solve the reactivity problem, but arguably that's OK in the case of the name of the platform.

grhoten commented 4 years ago

A lot has been discussed here. So I'll add a few thoughts.

In Siri's usage, we normally don't want to do something like medium=speech. The vast majority of the time, the print and speak lines are the same. If you look at SSML, you're typically just annotating the text to guide the right pronunciation, tempo speed, pauses, prosody and so forth. If you're using CSS, you're just annotating the way it's supposed to be formatted. Annotations or text attributes should be a separate concept from selectors.
In Siri's usage, we have the following types of selectors. I'm going to describe them without influencing you with actual syntax.
- Conditional span: Given the condition of a parameter show or not show a span of text.
- First span: Given a series of spans of text with conditions, execute the first one that matches. There is also a variation of this where multiple adjacent spans with the same condition are picked at random.
- Switch span: This works like a first span, but the comparison is against a single variable. It has to have a default span.
- Random span: This is like the first span, but there is no order of the comparisons.
- Dependent span: This is like a conditional span, but it depends on whether a previous span marked with an identifier was formatted. This is helpful for split verbs, like in German or Cantonese. There are times when the first part of the sentence needs to agree with the end of the sentence, and this helps. Sometimes copying the same text over and over is not feasible for maintenance.
- Phrase span: There are times that you need to repeat a phrase in several possible responses. The decision tree in the response may make copying the segment of the sentence hard to maintain in several branches. It's like a variable that doesn't represent a whole sentence. This may be similar to what Fluent does too. If I were to use a list of emojis with their spoken form, and use them in the singular, plural, definite or indefinite states in the same message, then I might consider using this phrase span.

There are times that we want to change the words chosen given the state of a device. If you're looking away from a device, we may want to be more descriptive with the choice of words. If the screen is showing, we may want to be a little more terse. If the voice is muted, we may want to be verbose in the print form. If we have really small screen space, we may want to print a little text and speak a little more. These states are not a part of how we annotate the message. We will use selectors based on the device state to chose the response. I consider this to be a design choice that does not have to be a part of the framework. The application and the message author can chose what states are appropriate in a message.

There is a question about localizability. If you allow complex conditions that involves AND, OR, NOT and parentheses, that can make it hard for translators to adapt. Some conditions need to be localized. So it's helpful to expose them. Though if you give too much flexibility, developers will put too much selector logic into the message requiring the same logic to be copied over and over into many languages. Some of that logic should have been left out of the message in the first place. So it's hard to find the right balance between flexibility and excessive complexity.

aphillips commented 4 years ago

@grhoten Thanks for that summary.

There is a question about localizability. If you allow complex conditions that involves AND, OR, NOT and parentheses, that can make it hard for translators to adapt.

I think successful designs do not expose the translators to the selection logic. The selection logic is forced to be outside the messages (it may be communicated as context to the translator, but the translators don't have to interact with or manage it). There is the possibility that this produces a large number of nearly identical strings for translation.

Phrase span: There are times that you need to repeat a phrase in several possible responses. The decision tree in the response may make copying the segment of the sentence hard to maintain in several branches. It's like a variable that doesn't represent a whole sentence. This may be similar to what Fluent does too. If I were to use a list of emojis with their spoken form, and use them in the singular, plural, definite or indefinite states in the same message, then I might consider using this phrase span.

Can you give a concrete example of this one? I think I understand, but I'm not sure that I do...

grhoten commented 4 years ago

@grhoten Thanks for that summary.

There is a question about localizability. If you allow complex conditions that involves AND, OR, NOT and parentheses, that can make it hard for translators to adapt.

I think successful designs do not expose the translators to the selection logic. The selection logic is forced to be outside the messages (it may be communicated as context to the translator, but the translators don't have to interact with or manage it). There is the possibility that this produces a large number of nearly identical strings for translation.

I think it may be possible to provide at least a subset of this functionality without using programming syntax. Nesting like you see in the ICU MessageFormat example is like an "and". If you allow selection of multiple values at once for the same span of text, that can be an "or" operation. For example, it could be a comma separated list of possible values.

I'm not sure how a "not" operation would be done. I guess a "not" would the default in a switch statement or the last span without conditions of a first span.

Phrase span: There are times that you need to repeat a phrase in several possible responses. The decision tree in the response may make copying the segment of the sentence hard to maintain in several branches. It's like a variable that doesn't represent a whole sentence. This may be similar to what Fluent does too. If I were to use a list of emojis with their spoken form, and use them in the singular, plural, definite or indefinite states in the same message, then I might consider using this phrase span.

Can you give a concrete example of this one? I think I understand, but I'm not sure that I do...

I think Fluent's web site has a similar example. The examples using -sync-brand-name is similar in concept.

mihnita commented 4 years ago

Can you give a concrete example of this one? I think I understand, but I'm not sure that I do...

+100 to that :-) The syntax does not matter. Can use something close to what you have, or change it to be unrecognizable, but keeping the concepts.

And maybe the "span" lingo is throwing me off. Is this something that can tag just part of the message? Like "........." in HTML?

One step up, I think decision here is about:

Changing "part(s) of a message", potentially with selection on some condition Same as inline elements in html (think b, i, a, span), and placeholders in XLIFF.
Selecting a full message based on some condition. Same as block elements in html (think p, div, li), and trans-unit in XLIFF 1.2, unit in XLIFF 2.x

It does not really matter how complex the condition is.

In general option 1 (part of message) are really bad for i18n. There is no good way to tell what will change in translation even if only a small part changes in the source. Things like "show or not show a span of text" can change the whole sentence in some languages.

aphillips commented 1 year ago

The above discussion is rich with useful examples. I suspect that it can be closed because we have adopted a selection model that can consume any external values to do message selection and we don't specify whether the values are "contextual" or explicitly passed.

A reason to keep this issue open might be if we need to define standardized contextual variables that all messages are guaranteed access to. But we might be better off with a specific issues about that rather than reusing this issue.

aphillips commented 1 year ago

Closing per 2023-06-19 telecon discussion. Foregoing comment still applies.