unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
232 stars 34 forks source link

Localization Units Formatting #118

Closed zbraniecki closed 1 year ago

zbraniecki commented 4 years ago

This is a complete braindump of my late night revelation that may be genius, crazy, foolish or any combination of those.

Background

It started with realization that the irk I have with the name of our group overlaps with the irk that Mihai expressed, but for different reasons. Mihai said "I think we may come up with something very different than MF 1.0, so naming it 2.0 is misleading and may implicitly steer us toward trying to salvage MF similiarity for compatibility reasons which may be a sunk cost fallacy" (paraphrase mine).

I reacted positively to that, because I recognize that there is a natural drift to "add to MF 1.0" just like I may have a drift to "bring Fluent to MF 2.0", and I think it may be limiting us in designing the optimal solution.

But as I dug deeper I realized that the concern I have is with the word "Message". The fact that we talk about formatting messages is already misaligned with how I think modern UI localization mental models should work.

For a simple textual app, you can have something like:

printf("You have 5 new messages.\n");

and MessageFormat 1.0 contains data model, syntax, logic and API to internationalize this line of code.

But UI paradigms are fundamentally different.

Let me give you an example:

Example

dialog-boxes-messagebox-default-button

What does it mean to localize it? What is the "message" and what do we mean by "formatting" it in such context?

There's definitely going to be some formatting going on, there are 4 strings in this widget, and an icon, but what is the "message"?

Well, you can decompose this widget into four separate widgets (title, label, button-ok, button-cancel) and try to say "each one of those has a value and that value is a message!", and I believe that's the most common model of approaching it.

But it doesn't scale in so many ways:

1) If there's a relation between the message and buttons (see Welsh where there is no generic yes/no, and a label for the button has to depend on the message it answers to question), we lost it 2) If there's any meta information about the widget, or its localization, it is now decomposed into four independent messages 3) If there is any behavior to between localization and widget, we need to perform it four times, one per message 4) If there are any arguments that are required to localize this widget, we need to send them to four messages 5) If we'd want the UI toolkit to plug "localize" step before layout/paint, we need to write some code that formats those 4 messages and applies them onto that widget 6) Is the icon a fifth message? It may flip in RTL contexts, and icons may contain text or culturally specific graphics that may have to be part of the localization of this widget. 7) What if the button-ok, button-cancel, icon, label or the whole modal window have tooltips? 8) What if they have accesskeys? 9) What happens when there's any error in applying localization onto this widget? Are we falling back onto another locale? For one of two buttons? For label but not for buttons? How do we reconcile? 10) Is localization of the button synchronous, or asynchronous? If there's fallback, which may require I/O for resources, is it synchronous or asynchronous? How does the binding function for the widget to apply those 4-5-10 messages onto it look like? 10) Can you retranslate this widget to a different locale during UI lifetime, or do you have to recreate it in a different locale, remove the old one, add new one? If so, are you losing event bindings and state? 11) Can you cache the state of this element pre-localization, post-localization, can you invaidate cache of this widget if while loading you realize that translation is obsolete? 12) If the widget text is more complicated - if it's a paragraph of text, with images, stylistic annotations, or smart sentences like "Refresh the page every 5 minutes" where 5 is actually a numerical text input, or select dropdown, or your text for this widget is a list of items where the structure and number of items should be controlled by the localizer. How do you handle that when you are merely formatting a single string and you don't have a notion that it is part of a UI that is a nested tree structure with attributes, events, text, icons and data?

Two topics, that are intertwined but separate

I recognize that there are two topics here, my last question is from a bit different category.

1) Do we want to support localization of UI elements/widgets which are usually much more compound than a single string 2) Do we want to support localization of messages that have semantic fragments inside them

I believe that the questions are related, because they relate to breaking with the idea that a message is a string and a UI is a list of messages. In this model, UI is a tree (not list!) of compound widgets, each having multiple strings inside it, and each string may have its own UI fragment inside it.

Both of those issues are rooted in how UI is different from plain text, but we should imho treat those two questions separately and be open to having different solutions, or even considering one in scope, and another out of scope.

I'm bringing them up here because I want to challenge us with thinking about end-to-end localization of UI, and then you need to consider both.

How to design it?

Designing that system is actually very tricky if you stick to thinking of localization step of the UI toolkit as taking messages (strings), formatting them, and then applying in correct positions in the UI widget. You need a lot of boilerplate code that has to either be controlled by the developer writing the code, or by the widget code, or by the toolkit and in each case is non trivial, hard to handle sync/async, limits fallbacks and, I will argue, ...

misses the point.

Localization Unit

Because you cannot localize a compound nested, rich User Interface widgets by formatting "messages". You need a concept that is broader than a single string - something I started calling in my mind "Localization Unit". This of all the data needed to localize the above example:

hello-prompt  = {
    "meta": {
      "role": "modal window",
      "description": "..."
    }
    "elements": {
         "label": ["Hello, ", Element("strong", [Argument("userName")]), "!"],
         "button-ok": {
           "label": "Ok", //  In Welsh `[Reference("self", "label"), "lorem ipsum"]`
           "accesskey": "O",
           "tooltip": "Click to accept"
         },
         "button-cancel": {
           "label": "Cancel",
           "accesskey": "C",
           "tooltip": "Click to reject"
         },
         "close-icon": {
           "tooltip": "Close the prompt"
         },
         "main-icon": {
           "url": "@icon-path",
           "aria-label": "Question mark icon"
         }
    }
}

And once you have it, you can do the most natural thing: you can bind such UI element to a corresponding localization unit.

<prompt
  l10n-id="hello-prompt"
  l10n-args="{userName: 'John'}"
>

or:

prompt.l10n.id = "hello-prompt";
prompt.l10n.args.set("userName", "John");

Such binding is declarative, just like applying a CSS class onto an element is, and it allows the engine to understand that before layout and painting steps for this element some resources need to be retrieved, their Localization Units must be resolved and the combination of the element and its localization unit is what gets laid out and painted.

This model has a huge number of benefits:

LocalizationUnitFormatter

In ICU we actually already have a notion of such intermediate representation of data - FormattedX. For example, DateTimeFormatter produces FormattedDateTime which has a lot of information allowing users to introspect, operate and maybe even manipulate formatted data. The user can also just toString() it to get the result.

What if we had LocalizationFormatter which has a format method that returns FormattedLocalizationUnit which has all the information needed for a UI toolkit to combine it with Label, MenuItem or Button or any other widget and produce a LocalizedElement or LocalizedWidget that will be then laid out and painted?

And for the imperative case, we could still have toString which would take the value of the LocalizationUnit if it has one, and just print it as a string for the familiar printf experience.

What's in scope?

I don't know yet. It's kind of a fresh realization and I'm not sure if my recommendation for the group is to:

a) Consider Localization Unit in scope as a level above MessageFormatter. b) Consider Localization Unit out of scope, but the right paradigm for UI localization and therefore work on having MessageFormat 2.0 be a good lower level API for it c) Consider Localization Unit one of many paradigms for UI localization and not tie our work to it d) Consider Localization Unit a bad paragidm and design a better one

Why am I raising it?

The reason I think it is important is that we need to early on decide whether what our target is does:

printf("Hello, { $user }");
Label.textContent = format("Hello, { $user } ");

and we are ok thinking of the receiving end as flat textual strings, or do we want to embrace that fact that this is not how UI localization is today. That Label may have multiple attributes, and icons, and other values and each one may be a nested structure of data and localization may bring its own UI fragments that need to be overlapped with source fragments. That the function in which you call printf is not the right place to synchronously annotated the UI with a string, because then the toolkit doesn't know that the UI is localized, cannot retranslate, cannot cache, cannot invalidate that cache, and cannot have responsive localization.

I think that decisions around it will have deep consequences for our thinking about many items on our wishlist (#3)

I wrote a separate comment for Raph's new UI toolkit paragidm over last day of wrangling with this concept. If you're interested in more particular tangible application of how it may look like, consider reading https://github.com/raphlinus/crochet/issues/7

romulocintra commented 4 years ago

Fun fact that i saw this morning 030E65AC-BAAE-4F9D-9206-F4A12826BDFB

Should i pay or not... i will pay or not! Final message is missing and translation unit unclear 😔

eemeli commented 4 years ago

The reason I think it is important is that we need to early on decide whether what our target is does:

printf("Hello, { $user }");
Label.textContent = format("Hello, { $user } ");

and we are ok thinking of the receiving end as flat textual strings, or do we want to embrace that fact that this is not how UI localization is today.

For a minimal example of how a flat string would be the wrong answer, consider the work required to make $user a clickable link in a React context. Either the message itself or the $user value would need to include some start+end tagging that would need to be parsed in a separate parser, or the initial formatted value of $user would need to be replaced by some opaque identifiable string that's then re-replaced by an appropriate link. And then consider what it would take to allow for any lexical transformations of the $user string.

On the other hand, if we do provide a formatToParts style API, we can get something like this out of our formatter:

[ 'Hello, ', { var: '$user', value: $user } ]

and have a much easier time all around.

So while that doesn't exactly speak to localization units, it is a significantly smaller step to go from array output to object output, i.e. translating a whole unit at once, or at least giving it a scope only once. On the other hand, sometimes you do want to format just one message, so the API should support that too.

Continuing with the hello-prompt example, my (JS, React) approach would be to first transpile it to something like this:

// hello-prompt.en.js
import { createElement } from 'react'

export default {
  meta: { role: () => ["modal window"], ... },
  elements: {
    label: ({ userName }) => ["Hello, ", createElement('strong', null, [userName]), "!"],
    "button-ok": { label: () => ["Ok"], ... },
    ...
  }
}

And then provide an API on top of that, maybe something like this (ignoring e.g. error checking):

import { getMessage, getUnit } from 'some/where'

// called e.g. formatMessage('hello-prompt', 'en', ['elements', 'label'], { userName: 'Bob' })
// would resolve to ['Hello, ', <b>['Bob']</b>, '!']
export async function formatMessage(unitId, locale, path, scope) {
  const unit = await getUnit(unitId, locale)
  const message = getMessage(unit, path)
  return message(scope)
}

// called e.g. formatUnit('hello-prompt', 'en', { userName: 'Bob' })
export async function formatUnit(unitId, locale, scope) {
  const unit = await getUnit(unitId, locale)
  return (path) => {
    const message = getMessage(unit, path)
    return message(scope)
  }
}

And walking through that, I think our focus should be on the first part, on enabling the work that goes into parsing and transforming messages to an executable form, while making sure that unnecessary limitations are not imposed on other API layers that may work with the data in all sorts of ways.

One point where these layers interact is the Reference("self", "label") of the Welsh message in the example, and how that might interact with unit composability. By that I mean e.g. the component that itself includes the hello-prompt. If we presume that it makes sense to localise or assign scope to the whole prompt at once, by extension it may make sense to apply the same operation to a larger component that itself includes other localisable components.

In other words, if we can have a guest-view and a staff-view that both include a greeter that includes a hello-prompt, the localization units of both of those views ought to be able to refer to their specific hello-prompt, rather than just any unscoped hello-prompt. And I think that's fine, if either:

  1. The reference may be a path like ['greeter', 'hello-prompt'], or
  2. A unit is not allowed to set default values for any of its own or child scope parameters, which would allow for a direct reference to 'hello-prompt', skipping the greeter scope.

This is again one of those things that should probably be warned about by the linter, but which the language should allow.

stasm commented 4 years ago

I'm going to continue to beat the building-blocks-only drum, same as in #65. I think this issue is a great example of innovation that should be made possible by the low level MessageFormat 2.0 API. I'd like to see it implemented as a userland solution (which might get standardized in the future independently of MF).

There's many things to get wrong if we set out to design a holistic solution. Which is why I'm a big believer in low-level agnostic API design which is only concerned about returning a valid sentence in the target language.

Similar to @eemeli, I think that a formatToParts API can give us a lot of flexibility while still allowing the standard to be agnostic wrt. to how exactly it plugs into the UI. This is important to me because I think we'll see a lot more research and evolution in how UIs are built in the future. I'd like MF2.0 to be future-proof, and my proposal of how to achieve it is to not do the job of UI frameworks, but instead allow integrating with them in a way that fits their design best.

@zbraniecki's list of 13 don't scale questions is full of questions with no obvious good answers. The questions about sync/async, before-paint localization, fallback, retranslation -- all of these involve tradeoffs which I'd prefer be made by the API consumers, not the standard itself. They're great questions, btw, but I think our job is to let other people answer them to suit their business and non-business needs :)

I'd also like to caution against nested units, in particular arbitrarily nested. At the extreme side of this idea the whole app is a single unit, with multiple descendants. This might even be theoretically correct, but practically it's rarely desired. From the tooling perspective, it's convenient to have a definitive "leaf" type which cannot have any nesting inside; nested units mean there's no such type. Furthermore, in the nested model, the hierarchy of localization units within other units can become tightly coupled with the layout of the source code; this spells trouble for any sort of refactoring. Even CSS needs patterns like BEM to reduce the tight coupling. IMO the best way to solve this problem is to avoid it altogether by storing translations in a flat list of non-nestable units.

To conclude, I think my views are best summarized by the following bullet point from your what's in scope list:

c) Consider Localization Unit one of many paradigms for UI localization and not tie our work to it

zbraniecki commented 4 years ago

@romulocintra hah! great example of bad UX in result of per-string level!

@eemeli: I think what you're prototyping is going in the direction of solving the issue, but I don't see how you intend to execute the "at runtime I bind the element to its localization unit" part yet. (not critique, just observation that this is I think important piece of the puzzle)

@stas: No need to be defensive about sticking to your position. I think it's a very valuable one and since we are brainstorming many areas and angles, it is natural that we bring our perspective to each angle. I see it as a good thing :)

Saying that, I'm confused about your response because your first and last sentence are incompatible for me:

I think this issue is a great example of innovation that should be made possible by the low level MessageFormat 2.0 API. (...) To conclude, I think my views are best summarized by the following bullet point from your what's in scope list:

c) Consider Localization Unit one of many paradigms for UI localization and not tie our work to it

If we go with (c) then we will not tie our work to it (compare that to (b)) and therefore not verify that our decisions lead to enabling such approach to work.

To make the distinction between (b) and (c) tangible - when we discuss if we should allow messages to reference one another, we may be at the place where this paradigm is the strongest justification for such feature. And in such place it may simplify our design to not allow for it. If we would have a vote on whether we should simplify our design/datamodel and therefore prevent message references from being possible, and the strongest reason to not simplify it would be UI localization needs, how would you vote? If we are in (c) position I feel comfortable saying "let's not do this, enabling Localization Unit paradigm is not in scope". If we go with (b), which is in my mind much closer to your opening position then we should be able to say "since we believe message references are important for Localization Units, that alone adds a lot to rationale of investing in message references".

There are more pieces like that. Similarly with other areas, Localization is not easy to layer into separate independent layers where you can design a layer in vacuum and expect another layer to just hook-in. With Fluent, a lot of our decisions were made because another layer needs something, so in this work, I'm trying to understand what approach to other layers we want to include in the checklist of things that our layer handles and enables.

stasm commented 4 years ago

If we go with (c) then we will not tie our work to it (compare that to (b)) and therefore not verify that our decisions lead to enabling such approach to work.

I understood (b) as saying that this was the right paradigm for the UI localization, and that MF2 should focus on enabling just it. I'm closer to the opinion that it's one of many paradigms, hence I picked (c).

I'm also not sure what tying our work to it implies. I think it's helpful to keep this approach as one of the many use-cases that we want to make possible. Is this what you're proposing?

To make the distinction between (b) and (c) tangible - when we discuss if we should allow messages to reference one another, we may be at the place where this paradigm is the strongest justification for such feature.

I don't really see how they're related to a point of one being a blocker for another. It might be a good idea to discuss this in a separate issue.

nbouvrette commented 4 years ago

@zbraniecki I realized that I presumed that the main benefit for Localization Units was around "Context" yesterday on our call. Can you confirm what do you consider would be the main benefit? Catching up to the thread I'm unsure if its "Context" or "Ease of integration".

zbraniecki commented 4 years ago

@stasm - perhaps I'm not very good at explaining the avenues I see forward! Let me try again:

I don't really see how they're related to a point of one being a blocker for another. It might be a good idea to discuss this in a separate issue.

I don't think this should be treated as a particular issue, it's just an example. To make it more generic - if we will find a feature that would be primarily necessary for such Localization Unit model, how would it affect our work here? If we think of "separate layers" and "LU is out of scope and we should not tie our work to it" then I can see a rational behavior of not including such feature and thus making LU not be able to be based on our work.

@nbouvrette

I realized that I presumed that the main benefit for Localization Units was around "Context" yesterday on our call. Can you confirm what do you consider would be the main benefit? Catching up to the thread I'm unsure if its "Context" or "Ease of integration".

Great question! I don't think I have a clear answer, but let me try (treat it as an input to a brainstorm):

I think those two are actual reality of the GUI apps for a long time, and they are certainly true for HTML (and thus the Web). I'm sympathetic to the notion that this concept is out of scope of our work, but I believe it is our responsibility to bother ourselves with how our work will impact the future. I think that except of Fluent I have not seen an attempt to replicate a matching localization system for that model, and instead I see repeated attempt to take i18n_printf and fit it into those two concepts which makes l10n always feels like an afterthought in GUI UX .

I come with a worry that the vibe in this group is that we have "enough on our plate" with lower level considerations that we'll be eager to cast everything we can out of sight and out of mind. "it's not our problem", "that's for the future" is a reasonable statement if we had a good plan for making breaking changes in the future and reasonable hope that if needed we can adapt our data model. But I don't think we have that luxury.

If we are successful, JS apps will be written, and many of them work with React UI, HTML etc. If we are successful some future W3C WebL10n Work Group will be kicked off to standardize HTML/DOM bindings for localization and it would be really really bad if they had to conclude that JS message formatting is not compatible with that model.

So my hope is that we will conclude here that LU model is a good candidate for the system that Message Format 2.0 should be a foundation for and thus we will consider features deemed necessary for LU to be necessary or at least highly valuable for MF 2.0.

nbouvrette commented 4 years ago

The way you are explaining triggers one question in my mind: what is the mission of this working group?

For me, it’s to build a successor of MessageFormat that offers more than the current version. So, what is the current version offering?

So, unless we decide to become very serious about solving complex linguistic problems which would require lexicons in every language (that could also be tricky to fit natively in mobile browsers), what can this group realistically tackle? My personal thought after months of discussions:

Maybe I’m being too conservative but the more we discuss, the more I realize that this problem is complex and it’s quite easy to get lost into what to tackle.

Going back to your original proposal, I love the idea of a “Localization Unit” but also agree with @stasm that this could be dealt with by the library of the technology that would also support the new syntax. Ideally, this syntax should also be agnostic of file formats, programming languages, etc. I’m not sure based on your explanation if this is option b) or c) – But I think we should keep this in mind part of the paradigms that we should easily integrate with if this makes any sense.

mihnita commented 4 years ago

I am totally on board with the idea to have some way to "group" things. But (as usual :-) with some doubts in some areas:


  1. Not sure if "Localization Unit" is a good enough name (this is nitpicking, I agree). The term "Translation Unit" is already well established in the industry, and it means something else entirely.

  1. Bullets 9-12 feel somewhat outside the MessageFormat area, and belong more to "whatever gui framework one would build on top of MessageFormat" That's a discussion I would like to have with you (Zibi), part of the binding, because I am not sure I understand it.

  1. The access key outside the string is a terrible design. Many years ago I worked in a localization company that translated Netscape Navigator. And this was a huge source of problems. First, with this design it is possible to specify an access key that does not exist in the strings. Second, the access key will "stick" to the first instance in the string. This means that something like this:
    { "Sort by extension", "e" }
    { "Sort by size", "s" }
    { "Sort by time", "t" }

    ends up rendered like this (using _ for underscores, I can't find the markdown): "Sort by _extension", "e" "_Sort by size" "Sor_t by time" instead of the nicer: "Sort by _extension", "e" "Sort by _size" "Sort by _time"


  1. My "gut feeling" reaction is

b) Consider Localization Unit out of scope, but the right paradigm for UI localization and therefore work on having MessageFormat 2.0 be a good lower level API for it


  1. Android has something similar, but XML based (layout files) But you don't put the strings in the layout description itself, you use "references" to strings (using string IDs)

  1. I think what might help to have in the standard is the concept of "group" I touched it in the A MessageFormat Data Model document (slide 19), where I mention "Going “above” message level"
    • A “tree of messages” (HTML-like, allows grouping)
    • A list of messages (one after another, all equal, but the order matters)
    • A map of messages (“resource bundle” model, unique IDs)

  1. I would really encourage as many people as possible to at least scan the XLIFF 2.x spec.

There we have <group>, that we can use to build full trees.

But there are a lot of good ideas there (if you can ignore the fact that the format is XLIFF :-)

The spec not only describes the structures, but also why they are needed, how to use them, etc. Goes way beyond specifying a file format.

It will help when we get to the "map the data model to XLIFF" part :-) And it is like having a lot of localization / translators expertise at our fingertip.


  1. For links (@eemeli) I think that open-close placeholders would be a good option.

Already touched in my Localization concepts doc, slide 16, A MessageFormat Data Model, slide 17, Elango's data model proposal (PlaceholderType : OPEN / CLOSE / STANDALONE) (sorry, can't find the link right now).

And also in XLIFF, http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html#inlineCodes

We need it anyway for other things that require open-close concepts (BiDi spans, formatting)

formatToParts can work for if the text inside the link is an argument (like $user in the example), but it does not work well when it is plain text, known at translation time: "Read the {0}Privacy Policy{/0} for more info"

Placeholders should not contain localizable content, or we end up in the "deep nesting, message inside message" problem (think about some bold and some runtime argument inside a link)

Note: I've used <a> here, but I would normally use some thing that does not look like HTML. In other docs I've user {0} (open placeholder), {/0} (close placeholder), and {0/} for standalone placeholder.

For the data model it does not matter if that is a link, a bold, or a span with a style and "onclick" event. Which is actually a good thing...

DavidFatDavidF commented 4 years ago

Just echoing on @mihnita points 6 and 7 and attempting to straighten our terminology.

What @zbraniecki calls a localization unit in his example, really is a localization group (of units) as per the localization object model (as described in XLIFF 2) and also our agreed vocabulary.

While units can have multiple segments in a linear order (that can be changed in the target language using @order, anyways the compositionality is still only linear):, groups are designed to model structures as arbitrary trees, which I agree is needed for messages in modern GUIs. Grouping can have an arbitrary depth ut is otional, which I think is a good idea as it doesn't need forced on those who don't need it.. XLIFF or LIOM (Localization interchange object model) groups are designed to mimic an arbitrary source format (documentation or GUI). Here we are designing the source format, but it doesn't hurt to learn from LIOM ;-) not only because it will help later on with the mapping.. See slide 7 and on here

As @mihnita hinted, the XLIFF 2 spec when read ignoring its XMLisms, describes a generally valid object model, that's why we set up the XLIFF OMOS TC that works on restating the LIOM independently of the traditional XML serialization.. (to help abstract the XML independent business logic for wider reuse in I18n and L10n..

aphillips commented 1 year ago

In spite of the awesome amount of detail (thanks @zbraniecki for the brain dump!) I suspect this issue has (on the one hand) been superseded for MFv2 by various choices made along the way--particularly that we force complete patterns (no concatenation)--and (on the other hand) pushed out of scope (because TU management, segmentation, and such are more applicable to resource formats wrapping around MFv2 message strings. We should do things consistent with best practices in our syntax (like "complete thought patterns"), but not introduce additional features without cause.

Note that I have championed incorporating structures helpful to the localization process, such as XLIFF or ITS markup functions or comment syntax previously, but we have, as a group, excluded these features from the syntax and (so far) default registry.

Marking resolved-candidate for discussion of closing. Consider opening specific feature requests in place of this. Also considering contributing material taken from here to an eventual user guide.

aphillips commented 1 year ago

Closing resolve-candidates per discussion in 2023-07-24 call