unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
229 stars 33 forks source link

Do we allow multiple multi-select messages to nest inside one another? #103

Closed echeran closed 3 years ago

echeran commented 4 years ago

This issue is to start (continue) the discussion. Previous discussions have occurred in working group meetings and more recently https://github.com/zbraniecki/message-format-2.0-rs/issues/6

One option proposed in the above linked issue shows the style of nested multi-select messages that is currently supported in ICU MessageFormat, Fluent, etc.

Example: 10 friends from 2 countries liked her profile.:

{ PLURAL($friendsNum) ->
    [one] { $friendsNum } friend
   *[other] { $friendsNum} friends
} from { PLURAL($countriesNum) ->
    [one] { $countriesNum} country
   *[other] { $countriesNum} countries
} liked { GENDER($user) ->
    [masculine] his
    [feminine] her
   *[other] their
} profile.

The issue also links to this issue representing an alternative that has also been discussed during working group meetings: https://github.com/projectfluent/fluent/issues/4 . Applying the alternative to the above example might look like

{ PLURAL($friendsNum), PLURAL($countriesNum), GENDER($user) ->
    [one, one, masculine]     { $friendsNum } friend from { $countriesNum} country liked his profile.}
    [one, one, feminine]      { $friendsNum } friend from { $countriesNum} country liked her profile.}
    [one, one, other]         { $friendsNum } friend from { $countriesNum} country  liked their profile.}
    [one, other, masculine]   { $friendsNum } friend from { $countriesNum} countries liked his profile.}
    [one, other, feminine]    { $friendsNum } friend from { $countriesNum} countries liked her profile.}
    [one, other, other]       { $friendsNum } friend from { $countriesNum} countries liked their profile.}
    [other, one, masculine]   { $friendsNum} friends from { $countriesNum} country liked his profile.}
    [other, one, feminine]    { $friendsNum} friends from { $countriesNum} country liked her profile.}
    [other, one, other]       { $friendsNum} friends from { $countriesNum} country liked their profile.}
    [other, other, masculine] { $friendsNum} friends from { $countriesNum} countries liked his profile.}
    [other, other, feminine]  { $friendsNum} friends from { $countriesNum} countries liked her profile.}
    [other, other, other]     { $friendsNum} friends from { $countriesNum} countries liked their profile.}
}

Other sources, docs, and slide decks have been made for the working group in favor of these options -- feel free to include those here as part of the discussion.

asmusf commented 4 years ago

The example seems incomplete. If I understand what the [] notation implies, I would have expected [one, one, masculine] etc. and a lead-in of PLURAL($friendsNum), PLURAL($countriesNum), GENDER... ->

Was something like that intended ?

echeran commented 4 years ago

The example seems incomplete. If I understand what the [] notation implies, I would have expected [one, one, masculine] etc. and a lead-in of PLURAL($friendsNum), PLURAL($countriesNum), GENDER... ->

Was something like that intended ?

Yeah, you're right, it needed fixing in the way that you suggested. Hopefully, I amended it correctly.

aphillips commented 4 years ago

@asmusf, I agree. The top example above would be a painful syntax for us to adopt, as it's basically a giant string concatenation. Translators adore (heavy sarcasm) having to work around ICU MessageFormat's syntax. I have a heavy preference for a syntax that enforces complete strings at the cost of repetition when nested and thus prefer something like @echeran's amended example.

This is "ugly" in this case because of three levels of nesting making a lot of repetitious strings and we all get tired of typing them. CAT tools make short work of translating the variations and the results can be grammatically correct (ftw).

FWIW, Android doesn't allow nesting and the Android team when I've talked to them cited the combinatorial explosion and complexity thereof in deciding to not support it. I think that's wrong, but do think that the third level of nesting approaches unsustainability, particularly if you're working with a language with more "slots". I think the inconvenience is enough of a discouragement to deep nesting so IMHO we should support multiple nesting levels.

zbraniecki commented 4 years ago

My position is that we should be very careful about what we limit on which level. The idea that something is hard to work with is a great reason to discourage users and ecosystems from using it, but I'm very concerned about the idea of using data model or syntax to impose limitations based on our preferences and our read of current situation.

The set of linter rules, the set of high-level CAT tool features enabled/disabled, and the set of what each organization will allow for and disallow is a moving target. Data model is set in stone until another monumental effort like MFWG comes up, in no small part due to frustrations with the limits of the previous data model.

It is my belief that every time we impose our "best practice" by tailoring data model, we're shortening the life span of the result of our work.

eemeli commented 4 years ago

I'd rather not make concatenation illegal, but rather recommend against it.

If we follow the usually good rule of being lenient on input but strict on output, we could well support parsing either variant, but strongly recommend that any tool that's outputting MF2 source would use the second form.

Provided that we require any and all function calls to be free of side effects and make sure that we remain free of loops and other complications, transforming a message to a canonical form is a pretty easy operation, no matter what the syntax is.

nbouvrette commented 4 years ago

I was definitely in the camp of enforcing full strings until I had to deal with a real example this week which would have required to use 6 levels of nesting.

In these types of extreme scenarios, concatenation would most likely work better, at the cost of the linguistic issues and TMS integration that will come with it.

So I would tend to agree that supporting both approaches with clear best practices documentation might be a good approach.

mihnita commented 4 years ago

I think that allowing nesting only kicks the can down the road the to translation tools or the translators. So it does not really solve any problem.

Only that in order to make the life of the developers easier we make the one of the translators harder. But developers are better equipped to refactor such messages than translators. First because they are more technical, second because they can also change the code, so they have more flexibility.

Worse, messages go through segmentation, leveraging, etc., way before the translators get to touch them, and through validation (and TM updates, etc) after the translator. All of these steps would need to be changed in all popular translation tools. This will not happen, and there will be no addoption.

So allowing internal selectors all the way to translation tools will not work.


Anyway...

The two representations are 100% compatible:

You deleted {file_count, plural, =1 {one file} other {# files}} in {dir_count, plural, =1 {one folder} other {# folders}}!

vs (ignore syntax):

{ [file_count:plural, dir_count:plural],
  [   =1    =1] {You deleted one file in one folder!}
  [   =1 other] {You deleted one file {dir_count} folders!}
  [other    =1] {You deleted {file_count} files in one folder!}
  [other other] {You deleted {file_count} files in {dir_count} folders!}
}

It is actually possible to do this algorithmically:

foo {decision, plural, =1 {singular} other {plural}} bar

Take the prefix ("foo ") and add it as prefix to each "branch" of the selection:

{decision, plural, =1 {foo singular} other {foo plural}} bar

Then take the suffix (" bar" and add it as a suffix to each "branch" of the selection:

{decision, plural, =1 {foo singular bar} other {foo plural bar}}

It is like math: a × (b + c) × d == a × b × d + a × c × d :-)

This works recursively for multiple selectors. I have code doing this on the old ICU MessageFormat syntax, and the code is not complicated (and it's old, I have it around for maybe 2 years)

What I'm trying to argue for:

Options:

  1. We don't support internal selectors and force developers to use full messages. The two forms are equivalent, there is no loss of flexibility. There is nothing you can represent in "internal selector" form that you can't represent in "full message selector" form. We don't miss anything, now or in the future.
  2. We allow them in syntax, but not in the data model So when we parse the syntax as written by developers we do the conversion. There might be some performance cost in doing that, but it is "compile time"
  3. We allow them in the data model, and we convert when going to / from translation tools I think this is what the Facebook model does in their format.
  4. We allow them all the way to the translation tools I think doing this jeopardizes both linguistic quality and adoption.

We should of course look at pros and cons for each.

My choice would be 2. We keep the data model simple, and we don't loose any flexibility. We can start with 1, later on move to 2, and there is no need to change the data model for that.

Mihai


Note about 6 decisions in one message

Google restricts nesting to: 3 levels: if all levels are select 2 levels: one plural and one select (or 2 select) 1 level: plural or select

I've seen a few cases where 2 plurals would have been handy, and I want to make a case for allowing that. But in general the messages can be refactored to these restrictions (except for 2 plurals) And I've also seen horrible abuses, both to circumvent the restrictions, and to make the code simpler (at the expense if the messages). I'see is there is anything I can share.

So I am quite sure that 6 selectors can be refactored to be more localization friendly. (sure, I would have to see it / see something similar to be convinced that is not possible).

eemeli commented 4 years ago

In addition to the round-tripping which was discussed at today's meeting, I would like to present at least one real-world sample message that would be rather horrible to work with if selectors could not be included within messages:

Listing
{N, plural, one{one} other{{ALL} #}}
{LIVE, select, undefined{} other{current and future}}
{TAG}
{N, plural, one{item} other{items}}
{DAY, select, undefined{} other{on {DAY} {TIME, select, undefined{} other{after {TIME}}}}}
{AREA, select, undefined{} other{in {AREA}}}
{Q, select, undefined{} other{matching the query {Q}}}

That is a natural-language summary for the results of a search among an event's programme items, as presented in a minimalist UI. It is literally the message that got me looking for a tool like MessageFormat to make it bearable to work with, and internationalisable.

That single message contains seven selectors, each with two cases. As MF1, it's pretty complex, but still better than any alternative. If MF2 only supported message-level selectors, a total of 128 cases would need to be defined for it, and adding or removing a selector would become effectively impossible.

mihnita commented 4 years ago

I've added this comment by email, with colors and fancy formatting, not realizing it comes from a GitHub issue. The result here was somewhat messy. And GitHub did not allow me to fix it (because "Email replies do not support Markdown", not clear why)

I have reformatted it below (with will make Longl Ho's comment look out of order). Sorry, my fault.

longlho commented 4 years ago

+1 to @mihnita Dropbox also has strict rules against nesting levels and enforces full sentences via linters as well, otherwise things got kicked back from our translation vendor as untranslatable.

mihnita commented 4 years ago

I've tried to translate it into Romanian, and I find that unreadable. And I am a developer. There are some pieces that disappear completely (for undefined)I will take the "other" cases for such selectors:

Let's take the longest possible combination:

Listing
{N, plural, one{one} other{{ALL} #}}
current and future
{TAG}
{N, plural, one{item} other{items}}
on {DAY} after {TIME}
in {AREA}
matching the query {Q}

And let's make N = 1 / 42:

Listing one current and future {TAG} item on {DAY} after {TIME} in {AREA} matching the query {Q}
Listing {ALL} 42 current and future {TAG} items on {DAY} after {TIME} in {AREA} matching the query {Q}

I have no idea what {ALL} or {TAG} mean, so it is hard to understand and translate the message.

The various substrings ("current and future", "on {DAY} after {TIME}", "in {AREA}", "matching the query {Q}") can "disappear" at will.

Using "one" is wrong, and would break for non-English ("one" in French means 0 or 1, and in Russian it means 1, 21, 531, 9281, and so on).

It will be a major pain for languages that don't use spaces (Chinese, Japanese, Thai, etc.)

In Romanian (and other Romance languages) "current and future" should be singular / plural, depending on N, to match items. If I translate "item" as "articol" (~article), which is neutral, then I need:

"curent și viitor" (singular)
"curente și viitoare" (plural)

So that part has to be somehow "dragged" inside the selection. By the translator?

By doing this I am already forced to do some "nasty nesting", as all the text from one/{ALL} and item(s) becomes single selector.

Also, "item" should come before "current and future"

Worse: "matching" should also match the plural of items:

... 1 articol curent și viitor ... care se potrivește interogării {Q}
... 43 articole curente și viitoare ... care se potrivesc interogării {Q}

I don't count 7 selectors, I only count 6 (N shows twice). And only N is plural, "exploding" to 6 forms in Arabic. All the others are on / off "switches". But 6 or 7, this is just nitpicking, I admit.


TLDR: this makes the life of ONE English developer easier. And makes the life of 40-60-80 translators harder. So we just passed the hardship from 1 "techie" to 80 "non-techies".

I would rather translated 64 messages than this. It is faster, and probably cheaper:

To translate this I really have to "untangle" it in my brain (or on paper), so I end up with the many combinations that we tried to avoid. Then I translate, and try to figure out how to "compress" it back (including the ugly nesting that the English does not need, but I do).

So, if anything, I see this example is an argument for NOT supporting internal selectors.

Mihai

mihnita commented 4 years ago

How would I handle the example above?

This sounds like a pretty technical message, which does not sound very natural anyway. I would sacrifice a bit "natural" feeling and do something like this:

{[N:plural LIVE:select],
[  one undefined] {Listing one {TAG} item}
[  one     other] {Listing one current and future {TAG} item}
[other undefined] {Listing {ALL} # {TAG} items}
[other     other] {Listing {ALL} # current and future {TAG} items}

And "build" the EXTRA_INFO from individual pieces, not trying to be natural. Probably using a list formatter, with the following "parts":

WHEN: when: {TIME, select, undefined {on {DAY}} other {on {DAY} after {TIME}}}
WHERE: where: in {AREA}
QUERY: query: {Q}

And then ListFormat(WHEN, WHERE, QUERY)

I know it is not ideal... but I think the result is not that bad. Probably less bad than translating the original.


Still not sure what {ALL} and {TAG} might be. I would need more context. If for example {TAG} is something like "cheap items" or "expired" of "newly arrived" (any adjective, really), then it should also match the gender and number of item(s), so you can't just take any fragment of text and put it there.

Let's take "new", for example:

... articol nou ...
... articole noi ...

("new" goes after "items"):

eemeli commented 4 years ago

I have no idea what {ALL} or {TAG} mean, so it is hard to understand and translate the message.

{TAG} indicated a category or tag assigned to a programme item, e.g. "panel discussion" etc. In English, {ALL} was either the string "all" or the empty string "", to indicate whether the results listed all or a subset of the {TAG} items. I entirely agree that my example is pretty horrible to work with, but OTOH it did successfully do just what it needed to; I wrote the Finnish & English ones myself, and worked with two other translators for other languages. I'm still not aware of any existing format that could've expressed that more effectively than MF1; maybe Fluent? I would be happy for a version of this message to be used in our test cases, as a sort of degenerate case.

@mihnita's suggestion of splitting up the parts and then using ListFormat to collate them is probably how I'd solve it now, but that wasn't possible back in 2014 when I wrote the original. It's also a good argument for allowing messages to be built out of other partial messages.

Summarising/rephrasing my points here:

  1. I strongly believe that MF2 should provide at least technical support for being a transpilation target/interchange format for a wide variety of source formats.
  2. The format itself should allow more things than what's allowed by linter rules. The range of potential users for MF2 is really wide, and we should not presume that we can grok all of their use cases or usage patterns.
  3. Given that it's actually really rather easy to transform a complex message into one that has only top-level selectors, we should allow for that transformation to happen within MF2, rather than by necessity outside it.
  4. If we define a transformation of an MF2 structure from whatever form to top-level-selectors-only, and require that to be (a) valid for all functions that might be called within the message and (b) require/define it to be equivalent to the original message, we can achieve all of the benefits of forcing the structure to always have that form.
  5. It's slightly harder to split up a message into a minimal set of nested component selectors, and it's somewhat tricky (but still possible) to go back to the original form of the message. We don't need to define those transformations, just acknowledge that they're possible, and may be interesting for optimising e.g. file size or version control diffs.
zbraniecki commented 4 years ago

@mihnita's suggestion of splitting up the parts and then using ListFormat to collate them is probably how I'd solve it now, but that wasn't possible back in 2014 when I wrote the original. It's also a good argument for allowing messages to be built out of other partial messages.

I'm a strong proponent of message references. Either via bundles like Fluent does, or via argumented contexts like Mihai is suggesting:

file1.res
hello = Open { preferencesSections }

file2.res
preferencesSections = Preferences

callsite:
let res1 = getResource("file1.res");
let res2 = getResource("file2.res");
let msg = res1.getMessage("hello");
formatMessage(msg, args, ctx = [ res1, res2 ]);

which allows hello to be resolved with a reference to a message from res2 since it is passed as a "context".

The format itself should allow more things than what's allowed by linter rules. The range of potential users for MF2 is really wide, and we should not presume that we can grok all of their use cases or usage patterns.

Super strong agreement. I believe that every time I see a suggestion to reject something in the data model because "it's not a good practice" is shorting the longevity of our solution. Blocking it on the linter/tooling/resolver level is providing the same security today while severely lowering the barrier to change if we are wrong or if someone develops a need we didn't forsee.

Given that it's actually really rather easy to transform a complex message into one that has only top-level selectors, we should allow for that transformation to happen within MF2, rather than by necessity outside it.

+1

eemeli commented 3 years ago

Given the consensuses reached at last week's meeting, should we have another task force meeting to see if we could resolve this one? As relevant context, see also the conversation in #130 for what this decision will effectively imply.