unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
217 stars 33 forks source link

**[ACTION REQUIRED]** Stack rank syntax options #493

Closed aphillips closed 9 months ago

aphillips commented 9 months ago

This issue serves to track working group member's opinions on the syntax options to use for our switch to text-mode. Please reply to this issue with a comment that shows your stack ranking of potential syntax options using a format similar to @aphillips's comment (which will be the first in this issue).

Syntax options are given here: beauty-contest

In addition, please stack rank your requirement priorities (an example shown in @aphillips's comment). No template is given for these: it's just to enable us to discuss relative priorities.

Please submit your comment before the 2023-10-16 teleconference.

aphillips commented 9 months ago

My stack rank for options is:

3a > 1a == 1b > 4 == 3 > 5a > 2 > 1 > 0 > 5 > 2a > 6

I ranked all of the items, even ones I would have a hard time accepting.

My stack rank for requirements is:

  1. start in text mode
  2. minimal nesting/embedding
  3. fewest syntactical quirks to learn
  4. ease of visual parsing
  5. internal consistency
SimonClark commented 9 months ago

3 > 3a> 1a > 1 6 > 5 > 4 > the rest

Criteria: Can I easily parse what mode I am in (output, execute, execute & return)

with 3, everything I need to unambiguously determine what mode I am in is on the line. I know I don't have to check the lines above and below.

Previous comment, now changed...
1a > 1 > 3 > 3a > 6 > 5 > 4 > 2 > 0 > 5a

Criteria were largely the same as Addison's, excepts added "simple, understandable rules for when in code mode vs text mode"

5 would have been at the top, had the rule of "even depth of braces - text mode, odd depth of braces- code mode" been met. 
aphillips commented 9 months ago

All... @stasm has added three more variations to the "contest", which I have just now merged. Please also consider these in your voting. I am editing my response above to address the changes. Sorry for the late churn.

Crell commented 9 months ago

Are we just supposed to post our ranking here? If so:

3b > 1a > 1 > 1b > 3a > 5 > 3 > 4 > 5b > 6 > 2 > 2a > 0

It's a bit hard to rank some of these, as they have overlapping features, and there's other potential overlaps not listed here. But my main driver is approachability by non-coders. Translators are going to have to read and write this text, which means it needs to be clear and self-documenting for them. Also, I expect the vast majority of use cases to be simple: A basic string with a single replacement. ("User $foo logged in", etc.) That means the common case needs to be simple and easy, and then the fancier stuff "layers in" on top of it.

Another observation: 3a looks very close to how both Rust and PHP do attributes/annotations. (One could debate if that's good or bad, I suppose.)

eemeli commented 9 months ago

1a > 1 > 1b > 3a > 3 > 3b >> 2 > 2a > 5a > 5 > 4 > 6 > 0

I think a viable solution should be found in either the 1/1a/1b or 3/3a/3b syntax families. Beyond those, with 2 & 2a the changes needed to go from a simple pattern to something more are really rather extensive. I find the explicit blocks of 5/5a difficult to figure out, and 4 goes Perl in the places where we've tried to be rather SQL.

With the 3x family, my main concern is that they often result in lines starting with #, a comment-start character in a number of existing formats like .properties and YAML. Technically it's fine in both of those, but many syntax highlighters treat such lines incorrectly as comments. Needing to quote # in patterns is also a pretty significant burden, and the way that 3b ameliorates it means that e.g. #[local can start either a statement or a variant key, depending on where it shows up.

I also have a strong preference for a solution which follows the consensus from last week establishing {{...}} as the preferred optional pattern delimiter. Of the contestants, this is not followed by 0, 1a, and 2a.

macchiati commented 9 months ago

I didn't do a deep analysis, but I think all of the examples would profit from having - in the examples - some text that would require quoting (wouldn't have to be parallel with others).

In particular, '#' is too common a character to use alone as a sigil, and I share the concern with # being a common comment-start character, so for me 3 is at the bottom. Also agree that "Translators are going to have to read and write this text, which means it needs to be clear and self-documenting for them."

1a = 2a > everything else

On Sun, Oct 15, 2023 at 8:04 PM Eemeli Aro @.***> wrote:

1a > 1 > 1b > 3a > 3 > 3b >> 2 > 2a > 5a > 5 > 4 > 6 > 0

I think a viable solution should be found in either the 1/1a/1b or 3/3a/3b syntax families. Beyond those, with 2 & 2a the changes needed to go from a simple pattern to something more are really rather extensive. I find the explicit blocks of 5/5a difficult to figure out, and 4 goes Perl in the places where we've tried to be rather SQL.

With the 3x family, my main concern is that they often result in lines starting with #, a comment-start character in a number of existing formats like .properties and YAML. Technically it's fine in both of those, but many syntax highlighters treat such lines incorrectly as comments. Needing to quote # in patterns is also a pretty significant burden, and the way that 3b ameliorates it means that e.g. #[local can start either a statement or a variant key, depending on where it shows up.

I also have a strong preference for a solution which follows the consensus from last week establishing {{...}} as the preferred optional pattern delimiter. Of the contestants, this is not followed by 0, 1a, and 2a.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/493#issuecomment-1763462825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCNF3HOOUWTY4KCC4DX7QQSRANCNFSM6AAAAAA54PXS6E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

echeran commented 9 months ago

0 = 2a > everything else

My problem with the other options (1, 3, 4, 5, 6) is that, for non-simple messages, the insistence on text mode for patterns complicates things. It is also wrapped up in implicit behavior about consuming surrounding whitespace (ASCII space, tab, newline).

It feels inconsistent and error prone. Ex: Some languages use non-ASCII whitespace, so if someone absentmindedly switches into the input method for such a language before hitting the spacebar, they might end up getting significant whitespace to start their message when they thought it was just harmless visual separation.

Options 1, 3, 4, 5, 6 introduce repeated syntax for non-simple messages that could be dropped if the whole message was in code mode. Also, the caveats they have about handling certain words that begin a message could be dropped similarly.

I'm okay with simple messages starting in text mode, which is supported by Option 2a with the least complication to non-simple messages. But Option 0 has none of the problems of Options 1 and 3-6 since it's starts in code mode, so I'm equally happy with that, too.

And if we can't make up our minds, then I am happy to not overturn our existing decision, which is Option 0.


As an aside:

There's a special ugly I find with Option 6 that tries to introduce programming language style syntax. We're only dealing with data that is ultimately the input for an API, right?

The other thing about Options 3-6 that struck me is that they seem to be trying too hard to shave characters. The thing is that I was advocating in mid-2022 strongly for a syntax (originally included in the EM proposal) that used delimiters rather than keywords. I thought it was clear and concise, but I relented because @stasm and others strongly preferred keywords, also on the grounds of readability/clarity.

Are we implicitly revisiting that decision, too? Because if so, then taking character code golf to its logical extreme could lead back to my syntax proposal. I played around to get character counts and curly brace counts among the options. And if char counts matter, Option 1 doesn't look great.

All I'm looking for is clarity and consistency in what we want, not a circularly moving goalpost of "your sigils are ugly goo, but my new sigil syntax is beautiful".

vdelau commented 9 months ago

Unfortunately lacking the history of the various options, and have a bit of a hard time parsing the various options. I would like to second @Crell's reasoning: Keep simple cases simple and consider the target audience that has to work with this syntax. They most likely will not be software engineers.

mradbourne commented 9 months ago

3a > 3 > 3b > 1a > 1 > 1b > 2a > 2 > 5 > 5a > 4 > 6 > 0

My priorities:

Readability on a single line

Syntax variety for simplicity

Text mode first

New and expert users should be catered for

markusicu commented 9 months ago

0 = 2a > everything else

My problem with the other options ...

I agree with everything that @echeran said.

In particular, some of the benefits of starting in code mode:

I also agree with @macchiati’s statement about an initial # being very confusing.

mihnita commented 9 months ago

0 = 2a,2 >> all else


Escaping with {||} or {| |} feels very much lick a hack. Not intuitive at all.

With 2a spaces are never trimmed, it is WYSIWYG Spaces are never trimmed, in selectors or in plain text.


We don't show plain messages with trailing leading spaces. (" Hello world ")

But people will use this with all kind of "storage" options (hopefully).

But imagine using json:

{ "msg" : "     Hello world   " }

or gettext

foo( _("     Hello world   ") );

Nobody expects space trimming here, but it will happen.


There is also an assumption that spaces might be in the sources, but some languages (Chinese, Japanese, Thai) can delete them

But I gave an example with Chinese where it needs an "honorific space" (space in from of a person name). Which means that (for example) "Hello {user}" might need to be translated as " {user} hello" So the translator adds spaces How would a translator know that the spaces are supposed to be escaped?

Not to mention that the source text is not always English. Some companies will translate FROM Chinese to something else. So adding spaces is a thing.


WYSIWYG is the best policy.

stasm commented 9 months ago

2a > 3b > 1b > ?

  1. Make it easy to build a mental model explaining the syntax.
  2. Few surprises.
  3. Start in text mode.
  4. Parse single-line messages easily.
aphillips commented 9 months ago

Per today's teleconference, closing this issue.

Thank you all for your contributions here (and for those on the call, to that discussion).

The group has narrowed the options down to three slightly-revised candidates based on 1a, 2a, and 3a in the horribly-named "beauty contest". By COB Monday 2023-10-16 in America/Los_Angeles time zone there will be a new comparison doc available for review. By COB Wednesday 2023-10-18 a new "voting" issue will be created to capture people's feelings about these options.

The 2023-10-23 call will be dedicated to (trying to) achieve consensus on text-mode-first syntax.

Note: some of the designs consider different character sequences ("sigils"). This requires a separate technical conversation.

sffc commented 9 months ago

I'll leave two slates of votes:

I. My initial reaction not understanding all the tradeoffs: 1b > 3b > 1a > 1 > 4 > 3a = 3 > 0 > 5 = 5a = 6 >> 2 > 2a

II. After reading comments from @echeran and others in this thread: 0 > all others

The concern about whitespace being dropped does seem to be a real concern to me since it impacts i18n correctness, which is paramount over the beauty contest. This criterion does not appear to be in the table of tradeoffs.

I do not really like 2 or 2a because it requires context-switching for both humans and machines to distinguish simple from complex messages.

If the concerns about whitespace can be mitigated, though, I quite like 1b and building off prior art in templating engines.

sffc commented 9 months ago

The group has narrowed the options down to three slightly-revised candidates based on 1a, 2a, and 3a in the horribly-named "beauty contest".

Sorry for being late to the voting. This was on my to-do list along with a pile of other things and it just got to the top.

Why was option 0 eliminated? This slate of votes puts 2a as the only option that preserves whitespace, and 2a is the option with the strongest initial distaste from me.

Also a bit sad to see 3a win out over 3b. I think 3b is much more clear since the entire statement is contained within brackets. With 3a we get into making whitespace significant for parsing of the code context which is something I would like to see us avoid.

aphillips commented 9 months ago

@sffc Thanks for coming back to us with this.

Why was option 0 eliminated?

Option 0 starts in code mode, that is, it requires every message to be quoted, although the majority of messages are likely to be simple replacements (Hello {$user}).

The concern about whitespace being dropped does seem to be a real concern to me since it impacts i18n correctness, which is paramount over the beauty contest. This criterion does not appear to be in the table of tradeoffs.

Preserving (or not) whitespace is a hot topic. All 14 of the options allow for quoting the pattern (the pattern is the subset of the message that is actually formatted in the end). Any of the options could require quoting the pattern in code mode and one set (2, 2a) already require it.

The general feeling is that, if we allow unquoted patterns that do not trim whitespace, users will incorrectly include whitespace into their patterns that they do not intend--and this will happen on a high frequency vs. the number of messages for which the whitespace needs to be maintained as significant. See the document about pattern exterior whitespace.

It's to a separate consideration from the core syntax, although there is a slight overlap.

With 3a we get into making whitespace significant for parsing of the code context which is something I would like to see us avoid.

Whitespace is sometimes significant in every one of the syntaxes--for the separation of keys, options, and keywords. However, I don't see what you're seeing in 3a (noting that 3a was later modified to use %[key key] in place of %when{key key})? None of the spaces around keywords or selectors are meaningful nor are line breaks. Here's an example with all meaningless whitespace out:

#match{$foo}{$bar}#when{foo bar}Hello {$foo} you have a {$var}#when{* *}{$foo} hello you have a {$var}

Here it is again with the current 3a:

%match{$foo}{$bar}%[foo bar]Hello {$foo} you have a {$var}%[* *]{$foo} hello you have a {$var}

Note well: the "final contest" with three options will be about choosing a direction for the syntax. We can work on details of the syntax once we have a direction established. Watch for that "action required" issue in a few hours.

sffc commented 9 months ago

Ok. For significant whitespace, we just need to choose the lesser evil of always quoting messages and having a whitespace stripping algorithm that is going to cause hiccups to someone at some point down the road.

For whitespace in syntax, my concern about 3a and 3b is that with 3b, I definitely know when this statement ends:

#[match {$foo} {$bar}]

I could even write it across multiple lines and there's no ambiguity:

#[
    match
    {$foo}
    {$bar}
]

But with 3a, I need to know when reading the code how many arguments #match requires. For example, what happens here?

#match {$foo}
{$bar}

If it is invalid syntax, then whitespace matters. But if it is valid syntax, then it comes with readability challenges because the reader needs to know the semantics of #match in order to make sense of it.

aphillips commented 9 months ago

@sffc noted:

But with 3a, I need to know when reading the code how many arguments #match requires.

No, you need to know how a variant starts (in 3a, it is %[). Multiple lines for the match statement are permitted by all of these syntaxes.

I tend to agree with you that enclosing the statement is kind of nice when reading, but is less nice when writing. Note well that optional whitespace is just that--in both directions. My previous comment showed it left out. It can also, per your examples, be inserted. Our syntaxes are mainly LL1 or sometimes LL2

(chair 🎩 on)

Let's make this the end of this thread. All of your observations would be better posed on the thread in 499 where we're trying to puzzle out the consensus syntax.