[* ACTION REQUIRED *] Choosing a Core Syntax

unicode-org / message-format-wg

Developing a standard for localizable message strings

Other

237 stars 34 forks source link

[* ACTION REQUIRED *] Choosing a Core Syntax #499

Closed aphillips closed 1 year ago

aphillips commented 1 year ago

Per our discussion in the 2023-10-16 teleconference, we have narrowed the candidates for the syntax to three. These are described in this document

Following @aphillips's comment below as a general template, please stack rank your choice for the syntax to use in modifying the ABNF. Please respond before the group teleconference on Monday, 23 October 2023. Responses after that time will be ignored.

Any syntax that we choose is still subject to specific modifications using the normal group consensus process.

Of particular note, option 2a might be changed to have a non-enclosing "starter" sigil or starter character sequence instead of the enclosing sequence shown in the document. Similarly, option 3a uses a sigil %/%[ which is subject to change.

[!IMPORTANT] "Voting" or stack ranking will be used to inform a rough consensus discussion. This is not a "winner take all" type of exercise. We are most interested in making a good technical choice. Spend the time to elucidate your reasoning.

aphillips commented 1 year ago

(chair hat off)

My stack rank: 3a > 1a > 2a

I would accept any of these syntaxes.
I would prefer 2a to use a starter sigil instead of enclosing markers because the closing marker never has any meaning (nothing can come after the last pattern). The use of {#/#} helps alleviate the problem with up to four closing } in a row, though.
I would prefer auto-trimming whitespace (using current consensus)

I prefer 3a because:

I prefer to avoid enclosing syntax that requires users to remember and keep track of open/close markers over extended spans of message text. In 3a, all of the open/close syntax is directly around local constructs and there is no nesting.
Unlike other syntaxes, {/} is used for exactly one thing (expressions) with exactly one syntax for expressions.
- 1a uses {/} to delimit statements (such as input and match) and interior to these does not always preserve the expression syntax (in local)
- 2a uses consistent expression syntax, but also uses { for multiple things (code mode, pattern quote)
The omission of the when keyword feels like a win for authoring.

(chair hat on)

Please generally follow the above format in your own responses, although you are welcome to write as long or short as needed in the "I prefer"/"I hate" section or to common on other's responses.

Crell commented 1 year ago

Commentary in no special order:

Re Whitespace, I don't feel I have a good enough grasp of the nuance to express an opinion.
I don't hate any of them; I think all three are viable ways forward, and none would doom the spec.
An advantage of 2a is that multi-line "codes" are naturally grouped. With 1a and 3a, every line of a code directive requires separate escaping into code mode.
I don't feel 3a's %[foo] is a meaningful improvement over when[foo]. Given a non-coder audience, spelling out the word seems more user-friendly. (People who actually work as translators, please correct me if you disagree.)
I think the main distinction between the options is
- Are code blocks single line (1a, 3a) or multi-line (2a)?
- Are code blocks beginner-delimited (3a) or wrapped (1a, 2a)?
(Where multi-line implies wrapping)

So I think my ranking would be:

3a == 2a > 1a

I'm torn on multi-line code blocks vs single-line code blocks. However, I see the advantage of single-line blocks being only beginner-delimited, so if we go single-line code, that seems reasonable.

Also, regarding the sigil to use... Another option could be //. Yes it means "comment" in almost every programming language, but that's also true of #, and // at least is extremely rare in human-text so the odds of it needing to be escaped are very small. (This would only really work as a begin-only delimiter in option 3a.)

eemeli commented 1 year ago

1a > 3a > 2a

I really like the first-order simplicity of 1a, where all the text is outside curly braces, and all the special stuff is inside them. I honestly think that this is the best or least-worst option when considering translators who end up working directly with messages in MF2 syntax, rather than through any purpose-built tooling. The only place where 1a nests braces is within the match, and that we could get rid of by requiring multiple separate statements for the rather rare case of multiple selectors.

I think 2a is two syntaxes in a trenchcoat. I think it'll lull many developers to getting familiar-ish with the first single-pattern syntax while ignoring the second, and make it that much more likely to not pluralize messages that really ought to be pluralized, i.e. use Message count: {$n} rather than {$n} messages because the former doesn't need code mode. Because single-pattern messages are way more common, it'll mean that on variant messages a reader of the syntax needs to realise that:

with {# or {{ we're entering this special "code mode" that changes everything,
after the curly braces of the match {…}, curly braces aren't wrapping code, but patterns,
then we're back to the single-pattern syntax, except that here it ends in another closing curly brace,
and then after the final #} or }} the message must end, rather than continuing like it does in e.g. MF1 and Fluent.

A little bit of 2a's excessive complexity could be trimmed off by dropping the terminal #}, but that won't get rid of the conceptual syntax levels that @stasm illustrated rather well. It also relies quite heavily on the code-first approach we've been looking at for a while, so I rather doubt that it'll let us get rid of the doubled or tripled meanings that get assigned to curly braces, given that we'd not managed to do so previously.

I think 3a is better than 1a in differentiating statements from placeholders by using the % prefix for them. This does introduce a few drawbacks, however:

It's harder to spot that statements are mostly delimited by %…}, and that doesn't work for multi-selector variants. So to keep track of "code" vs "text" you need to remember the syntax for each statement, unless they're each on a separate line.
Because % is so common, we need something rare like the %[ sequence to start the "when", and that means we need some sigil to communicate what the other syntaxes use the word when for, and that's a bit unfortunate.

Otherwise, conceptually, 1a and 3a are pretty close to each other.

When comparing the formats, the message I've mostly been staring at is the simple-selector message, because I figure that'll be the most common not-simple message, and therefore it matters the most. And for that, the minified single-line syntax provides the most differentiation:

{#match{$foo :plural}}{#when foo}Hello {$foo} you have a {$var}{#when *}{$foo} hello you have a {$var}
{#match{$foo :plural}when foo{Hello {$foo} you have a {$var}}when *{{$foo} hello you have a {$var}}#}
%match{$foo :plural}%[foo]Hello {$foo} you have a {$var}%[*]{$foo} hello you have a {$var}

My greatest stumbling block when reading the above is catching that for 2a the first when foo is code and not text. Next up, it's separating out {$var}%[*]{$foo} into its constituent parts, and noticing in particular that the % is code and not text, attaching to the [*] and not the {$var}.

I don't really mind the character length of 1a being greater than the others. What's most important to me is the amount of stuff I need to mentally remember and track, and that to me is minimal in 1a.

One "ha ha, only serious" possibility for 3a would be to use the already escaped \ as the statement sigil, so we'd have e.g. \input and \when. This would not add any new restrictions on patterns, and would more clearly than any other choice conceptually attach to whatever follows. I mean, as it works for LaTeX, it could work for us as well? 😇

\match{$foo :plural}\when[foo]Hello {$foo} you have a {$var}\when[*]{$foo} hello you have a {$var}

My understanding is that we're now picking a general syntax direction so that we have a baseline from which to consider further questions, such as:

How to treat external whitespace, and whether we need to revisit the previously established consensus on this.
Whether statements other than when should be combined into a single preamble code block.

In addition to the negatives I've listed above, I dislike 2a because it forces us to decide about all these things at the same time, rather than allowing us to make stepwise progress. Both of the above concerns may be addressed later within a syntax based on either 1a or 3a. If we try to pick 2a, we'll need to resolve not just the general syntax direction but also at least the above issues before being able to make that decision.

vdelau commented 1 year ago

1a > 3a > 2a

For me 1a makes a lot of sense, although I'd revisit and simplify the whitespace quoting:

Either quote the whole pattern, preserving all whitespace inside, or have a 'bare' pattern where leading and trailing whitespace is trimmed. I find the 'quoted' literal syntax ({| / |}) to be very noisy and hard to parse.
When trimming, treat all whitespace equal, not only ASCII whitespace.
Personally, I'd also preserve whitespace in simple patterns that do not contain code besides placeholders, but for consistency I can imagine that whitespace would be trimmed. The reason for not doing this is that for automatically collected pattern, which would in that case be simple, will preserve whitespace.

I do prefer the concept of 2a over 3a, explicitly entering code mode. However, I dislike the code mode syntax as per Eemeli's reasoning above. I also have a hard time parsing that as a human, especially the when syntax. I think it was better with the sigil prefixes in one of the previous proposals, in combination with enclosing the when parameters in some form of brackets. I would prefer a marker only at the start of the message, which should only be optionally preceded by whitespace.

I do like option 3a in the sense that using a sigil instead of braces would work for me and could result in a cleaner look. However, I'd suggest to not abbreviate away the when. The same whitespace quoting concerns as mentioned at 1a apply.

stasm commented 1 year ago

My understanding is that we're now picking a general syntax direction so that we have a baseline from which to consider further questions, such as:

How to treat external whitespace, and whether we need to revisit the previously established consensus on this.

Whether statements other than when should be combined into a single preamble code block.

I agree that these are the two currently most impactful axis of decisions. (With the first one being a proxy for the decision between the triple-layer and dual-layer models.) And I think they are, in fact, the direction that we're looking for. It seems, however, that you'd like to consider them afterwards, which leaves me puzzled. What is the direction that we're picking now in this case?

I only thought about it today, but I wonder if it would be more helpful to vote on a matrix representing the above two questions. 1a and 3a cover one cell (trim whitespace, separate statements), 2a covers another (don't trim whitespace, group statements in a preamble), and we'd need to define what the other cells entail.

In addition to the negatives I've listed above, I dislike 2a because it forces us to decide about all these things at the same time, rather than allowing us to make stepwise progress. Both of the above concerns may be addressed later within a syntax based on either 1a or 3a. If we try to pick 2a, we'll need to resolve not just the general syntax direction but also at least the above issues before being able to make that decision.

Arguably, the same point can be made about 2a. It can also evolve to incorporate ideas from 1a/3a, most notably, the dual-layer model.

(A longer comments coming soon.)

aphillips commented 1 year ago

@eemeli

I find it amusing that you're unhappy with 3a because it removes when--which was your idea. If we go back to 3a with when we get:

%match{$foo}%when{foo}Hello {$foo} you have a {$var}%when{*}{$foo} hello you have a {$var}

@stasm noted:

I wonder if it would be more helpful to vote on a matrix representing the above two questions.

I don't think it would. We need a syntax. The various whitespace and organizational hiccups are inherently entangled with the syntax. We can pretend to discuss them separately, but the choices we make depend, fundamentally, on details of the message grammar.

I'll point out again what I've said elsewhere: the whitespace problem can be done away with by requiring the code-internal pattern to be quoted. So:

{#match{$foo :plural}}{#when foo}{Hello {$foo} you have a {$var}}{#when *}{{$foo} hello you have a {$var}}
{#match{$foo :plural}when foo{Hello {$foo} you have a {$var}}when *{{$foo} hello you have a {$var}}#}
%match{$foo}%when{foo}{Hello {$foo} you have a {$var}}%when{*}{{$foo} hello you have a {$var}}

Some syntaxes can have code preambles (blocks) and others won't make sense with them. But each is a variation of a given syntax. Let's pick one so that we can go back to what works in this WG: concrete changes to the ABNF.

stasm commented 1 year ago

I struggle to cast a definitive vote because there are many latent decisions in each of the proposals. Voting is polarizing, and I tend to think that the final solution should instead try to combine multiple good ideas from each of the proposals. Furthermore, I'm optimistic in that I think that the final syntax can be derived from any of the 3 currently discussed ones.

My priorities are:

Build on a sound mental model, which can be explained to and deduced by users.
Optimize for the single line representation.
Prefer keywords and avoid the proliferation of sigils, as they are reduce the discoverability of the syntax and are difficult to even search for.
Allow unquoted variant patterns.
Discourage the idea of putting text around the match statement, or putting more than one match in a single message (both are errors).

If we're voting on the general approach, or the mental model for what happens to variant patterns (see the illustration in https://github.com/unicode-org/message-format-wg/pull/496#issuecomment-1768382908), then 1a/3a > 2a.

The triple-layer model should be optional (via {{...}}), and the dual-layer model should be the default.
Simple unquoted messages should not be trimmed, while unquoted variant pattens should. This is inconsistent on purpose -- in my talking to people outside the WG this seemed to be the least surprising behavior.
As they currently stand, I think both 1a and 3a can be improved when it comes to their specific choices for sigils and otherwise.

If we're voting on specific look & feel of the syntax, then 2a > 3a > 1a.

Keywords should not be prefixed with new sigils.
Statements should be grouped together.
As an extra benefit, 2a requires the least amount of changes to the current ABNF, which makes it a good choice for the baseline of further syntax considerations.

If I could extend the comparison table with subjective reasons, I'd list the following pros and cons:

In 1a:

(+) It looks like a templating language.
(+) No new escape sequences.
(-) {...} is used for multiple concepts.
(-) Keywords are introduced with a new sigil, making the input statement (hopefully relatively common in messages with placeholders) be spelled as #input $var :func -- that's 3 words and 3 different sigils, which I'd like to avoid as much as possible.

In 2a:

(+) Keywords are not prefixed with extra sigils.
(+) Statements (input, local, match) form a coherent block at the beginning of the message body.
(+) The match ... when sequence is not broken up by delimiters.
(-) It's two syntaxes within one.
(-) The fact that the code-mode marker requires the closing part may suggest that {# #} is a kind of a placeholder which can be have text around it.

In 3a:

(+) {...} is only used for expressions.
(+) Keys stand out better between patterns thanks to the use of square brackets.
(+) Variant patterns can, but don't need to be quoted.
(-) Introduces a new sigil for keywords.
(-) Statements are not delimited.

I think the final solution should combine most of the pluses from the above list. In essence, this boils down to combining the code-mode block from 2a (in form of a preamble) with the dual-layer model for variants from 3a. For instance:

{# input {$date :datetime dateStyle=long} #} Today is {$date}.

{# input {$count :number} match {$count :plural} #} {[1]} One thing. {[*]} {$count} things.

Crell commented 1 year ago

@stasm I'm unclear in your final example, where does the match block end? Can anything come after "things.", and how do we know which is which? It looks like you're out of code mode there, but still within the conceptual match block, which is confusing to me.

samdark commented 1 year ago

Either 1a or 3a are good. I think there will be an issue with foced multiline translation strings as @eemeli pointed out.

eemeli commented 1 year ago

@stasm: I agree that these are the two currently most impactful axis of decisions. (With the first one being a proxy for the decision between the triple-layer and dual-layer models.) And I think they are, in fact, the direction that we're looking for. It seems, however, that you'd like to consider them afterwards, which leaves me puzzled. What is the direction that we're picking now in this case?

I think we're picking a general shape for the syntax, with a specific form to start, from which we may iterate further. I would classify the choices as:

Everything non-text is in curly braces.
Separate "simple" and "complex" syntaxes, with some toggle or special wrapper for the latter.
Curly braces are always expressions; statements use a different code-y syntax.

@aphillips: I find it amusing that you're unhappy with 3a because it removes when--which was your idea. If we go back to 3a with when we get:
%match{$foo}%when{foo}Hello {$foo} you have a {$var}%when{*}{$foo} hello you have a {$var}

That is a better syntax in many ways, but it does require escaping errant % signs in patterns due to the %when being special. As I put it earlier, it is unfortunate that we can't really do that. I'm pretty sure that \ is the only single-character sigil we could use for that, and that's why I did mention it in my previous comment, with the \when.

Crell commented 1 year ago

The problem with \ is that it's also the escape character in many languages' own string syntax, so could easily get interpreted as \w followed by hen. Which is likely not the intent.

aphillips commented 1 year ago

@eemeli

That is a better syntax in many ways, but it does require escaping errant % signs in patterns due to the %when being special. As I put it earlier, it is unfortunate that we can't really do that. I'm pretty sure that \ is the only single-character sigil we could use for that, and that's why I did mention it in my previous comment, with the \when.

\ is hopeless as a sigil because it has meaning in so many of the formats that will contain our syntax (and leads directly back to the double-escape peril).

I agree that % is not an ideal sigil. I disagree that "we can't really do that" with regard to escaping one more character in the syntax. We might do anything. The question is whether we should. 😁

I think double-sigils is a better guard against the need to do elaborate escapes.

I don't personally agree that including when as a keyword is all that helpful. Yes, it's self-explanatory now. But once you have like 20 messages it's not really adding anything. The gnarlier (and more important) bit is explaining to users what the variant keys mean, because that's what developers and translators need to really understand. This is why I've tried to find fairly compact syntax that is clear about the key set.

Perhaps:

@input {$invites :number maxFracDigits=0}
@match {$invites :number} {$responses :number}
[[  0   *]] You sent no invites.
[[one   0]] You sent {$invites} invite and received no responses.
[[one one]] You sent {$invites} invite and received {$responses} response.
[[one   *]] You sent {$invites} invite and received {$responses} responses.
[[  *   0]] You sent {$invites} invites and received no responses.
[[  * one]] You sent {$invites} invites and received {$responses} response.
[[  *   *]] You sent {$invites} invites and received {$responses} responses.

1a's version:

{#input {$invites :number maxFracDigits=0}}
{#match {$invites :number} {$responses :number}}
{#when   0   *} You sent no invites.
{#when one   0} You sent {$invites} invite and received no responses.
{#when one one} You sent {$invites} invite and received {$responses} response.
{#when one   *} You sent {$invites} invite and received {$responses} responses.
{#when   *   0} You sent {$invites} invites and received no responses.
{#when   * one} You sent {$invites} invites and received {$responses} response.
{#when   *   *} You sent {$invites} invites and received {$responses} responses.

2a's version:

{#input {$invites :number maxFracDigits=0}
match {$invites :number} {$responses :number}
when   0   * {You sent no invites.}
when one   0 {You sent {$invites} invite and received no responses.}
when one one {You sent {$invites} invite and received {$responses} response.}
when one   * {You sent {$invites} invite and received {$responses} responses.}
when   *   0 {You sent {$invites} invites and received no responses.}
when   * one {You sent {$invites} invites and received {$responses} response.}
when   *   * {You sent {$invites} invites and received {$responses} responses.}
#} <- I almost forgot this

sffc commented 1 year ago

1a > 3a > 2a

I like how statements are fully enclosed in 1a, like {#input $user :person type=informal} as opposed to broken up like in %input {$var :function option=value}. It makes reading a little easier and feels a little like Lisp. When I see a { I know that I'm exiting text mode and entering something different. Simple and straightforward.

3a requires more context when reading the message. For example, I don't like that %match {$foo} takes one argument and %match {$foo} {$bar} takes two. It's unclear exactly where the statement ends and the match cases begin.

Not a fan of 2a because there are even more contexts to keep track of. I can be in text mode, code mode, or text-embedded-in-code mode. In addition, it has the same problem as 3a where I need to keep track of where statements start and end.

stasm commented 1 year ago

I don't like that %match {$foo} takes one argument and %match {$foo} {$bar} takes two.

This is a good observation. It is less pronounced in the current main syntax, where the match statement can be thought of as a block that includes the variants: match <selectors> <variants>.

Once we break up statements and variant keys visually, like in {#match ..} {#when ..}, the match statement starts to look like just another declaration, similar to input and local. I actually like it! We're saying: here are the inputs, here are the locals, and here are the selectors.

To fix the different arity of match compared to input and local, in syntaxes that wrap expressions in curlies (input {$var :func}) we could allow more than one expression to follow the keyword, or we could allow more than one match to declare multiple selectors (or allow both).

{#input {$foo :func} {$bar :func}}
{#match {$foo :func} {$bar :func}}

{#input {$foo :func}}
{#input {$bar :func}}
{#match {$foo :func}}
{#match {$bar :func}}

flodolo commented 1 year ago

(following @eemeli suggestion to comment directly in the issue)

For me 3a > 1a > 2a

3a and 1a are more or less on the same level. I find the match syntax more readable in 3a, hence the slight preference.

My main concern with 2a is the amount of syntax keywords that could be interpreted as localizable text, either by human or machine.

One note about alternative sigils: some are surprisingly painful to use on international keyboard layouts (e.g. ~ and `). It would be good to account for that among the criteria for selection.

aphillips commented 1 year ago

Still a couple of days to share your opinion before the teleconference.

So far I have a matrix that looks like the table below. This doesn't include @stasm's input, because his comment has two different stack rankings. Recall that we are not formally voting: the stack rank merely informs our discussion. However, if you don't put in your ranking and/or comments and you aren't able to join the teleconference, we'll have to proceed without your input!

Place	1a	2a	3a
1st	6	3	6
2nd	5	2	4
3rd	2	7	3
NV		1

(NV == no vote) (Note: the rows don't add up because @Crell and @samdark each gave two items equal weight, which I counted as first place votes)

stasm commented 1 year ago

Please use 3a > 2a = 1a as my stackrank. The direction of both 1a and 3a is the right one when it comes to allowing unquoted patterns in variants. At the same time, I don't think they are good enough in their current form.

I'd prefer to visually distinguish statements and variant keys, rather than use {} for all three.
I'm against dropping curlies around the expression in input and local. This is particularly unfortunate in the case of local where the variable assignment = looks exactly the same as the one used in options:

{#local $foo=$bar :func opt=value}

This is why I'd favor a syntax in which the RHS is wrapped in curlies, which leads me to prefer a preamble block for all statements, to avoid too many brackets in general.

local $foo={$bar :func opt=value}
We already have too many sigils; I'd like to avoid adding new ones (and consider removing some of the existing ones...)

I also have a strong preference towards evolving the current main syntax in the direction that we pick here, rather than ta-dah!-landing the entire complete new syntax, which will be a lot of effort, especially to review. I don't see a point in oscillating between two sub-optimal designs. Instead, we should break down the necessary changes into a series of smaller PRs, and review and land them separately. If 1a or 3a wins, this will most likely lead through 2a anyways.

aphillips commented 1 year ago

@stasm Thanks! I've updated the summary table.

I also have a strong preference towards evolving the current main syntax in the direction that we pick here, rather than ta-dah!-landing the entire complete new syntax, which will be a lot of effort, especially to review.

I think that it would be hard to do changes piecemeal. But it's also important to recognize that the proposals (or their variants) pretty much recycle the lower-level parts of our current ABNF, e.g. variable and literal and such. Most of the changes will be in lines 1-14. The real effort will be the syntax spec.

I agree that our syntax is slightly sigil happy. I think this is partly an outgrowth of allowing operand-free expressions:

{$var :function}
{|quoted literal| :function}
{unquoted :function}
{:function} <- this needs : because otherwise function could be a literal

@eemeli has suggested that we could drop the : if we got rid of unquoted, but unquoted is extremely useful, particularly in variant keys. An alternative would be provide a built-in empty value. This would allow function to always be positionally determined.

expression = "{" [s] operand [s annotation] [s] "}"
operand    = variable / literal / blank
blank      = "$_"
annotation = (function *(s option)) / reserved / private-use

gibson042 commented 1 year ago

1a >> 3a > 2a

I agree with @eemeli's reasoning in https://github.com/unicode-org/message-format-wg/issues/499#issuecomment-1770252825 , which strengthened my preexisting preferences. Enclosing all special semantics like {…} is a good thing, and using the first part of the contents to identify what kind of special semantics apply sufficiently addresses the "overloading" concern from my perspective. This is itself is ground that has been well-trodden by templating engines, and in fact mirrors Jinja specifically to such an extent that it could even align more by changing only spelling to represent statements like {%…%} and placeholders/expressions like {{…}}.

3a is dragged down by too many special syntax forms (%<keyword>, %[…], {…}, {{…}}), and 2a is dragged down by having too many contextual layers and too much special syntax that looks like ordinary text (input, local, match, when), especially when that's exactly what it would be in the other of its secretly-distinct grammars as discussed above.

gibson042 commented 1 year ago

I agree that our syntax is slightly sigil happy. I think this is partly an outgrowth of allowing operand-free expressions:
{$var :function}
{|quoted literal| :function}
{unquoted :function}
{:function} <- this needs : because otherwise function could be a literal

Actually, that doesn't look bad to me (modulo #483 etc.)—intuitively, unnamed input to a function can be either a variable, literal text (quoted or unquoted), or absent.

echeran commented 1 year ago

Stack rank: 2a > why are we reopening multiple longstanding decisions to solve one problem, seeing as we can avoid it?

I would prefer 2a to use a starter sigil instead of enclosing markers because the closing marker never has any meaning (nothing can come after the last pattern). The use of {#/#} helps alleviate the problem with up to four closing } in a row, though.

Notice that 1a wraps every when case in a pair curlies, whereas 2a adds only 1 pair of enclosing markers. This is stated in relation to our current syntax, which was the result of our discussions a year ago, which was partially influenced by character shaving concerns. Option 3a solves the problem by introducing %, [, and ], which we have to worry about escaping in patterns because of its optional pattern delimiting. We should be very careful in introducing extra sigils that need escaping, given that we need to embed these message strings in different types of places.

I think 2a is two syntaxes in a trenchcoat. ... It also relies quite heavily on the code-first approach we've been looking at for a while, ..

I disagree, and this is overstated. 2a is basically our current syntax, but non-simple messages are wrapped in {#...#}. Our current syntax starts in code mode for all messages, which kept sigil escaping concerns to the minimum of { and }, and then we optimized non-simple messages by removing enclosing delimiters. All 2a does is put enclosing delimiters back, and that is all that is necessary to solve the original problem that surfaced last month.

If 2a is so terribly bad, then it means that our current syntax is bad, and we should blame the authors of our current syntax. That means us. That also invalidates our decisions and reasoning for the last 1.5 years. I think our current syntax fine, and not terribly bad.

I dislike 2a because it forces us to decide about all these things at the same time, rather than allowing us to make stepwise progress.

No, actually, it's the opposite. 2a is only about adding {#...#} around non-simple messages in our current syntax so that we can solve the original problem, which is to remove {...} from simple messages. All other options are introducing many of the following topics: text-mode-first everywhere, optional pattern delmiting, whitespace handling around patterns, possible i18n concerns related to whitespace, do we still like keywords?, new sigils, and importantly, having to escape those new sigils in patterns.

I do prefer the concept of 2a over 3a, explicitly entering code mode. ... I think it was better with the sigil prefixes in one of the previous proposals, in combination with enclosing the when parameters in some form of brackets.

@vdelau Agreed. This is my personal ideal preference, too. However, I have only been discussing 2a as it starts from where we as a group have arrived so far, solves the problem at hand, and doesn't reopen any further decisions.

However, if reopening any number of decisions is fair game, then yes, the EM proposal (ca. Jan 2022) provides elegant syntax that I quite like (and Annex 3 offers slight twists for people who like character shaving). It predates the concepts of input and local, but here is match as multi- & single-line:

{[{$foo :function option=value} {$bar :function option=value}]
[a b] {  {$foo} is {$bar}  }
[x y] {  {$foo} is {$bar}  }
[* *] {  {$foo} is {$bar}  } 
}

{[{$foo :function option=value} {$bar :function option=value}][a b]{  {$foo} is {$bar}  }[x y]{  {$foo} is {$bar}  }[* *]{  {$foo} is {$bar}  }}

Also, [...] is a familiar way to represent sequential data like a match case tuple.

However, the implication of reopening more than 1 decision is what worries the most about the discussions here and over the last month. It took us months to go from EM/EZ/SM proposals of Jan 2022 to a somewhat stable syntax in July 2022 in time for an ICU4J preview implementation. We started with 3/3 proposals all using sigils and 2/3 starting all messages in code mode and 2/3 delimiting patterns in non-simple messages, and arrived at our current syntax, and were okay for a year. Why the surge in interest to reconsider everything, and all at once? And why just to solve simple messages in text mode? I know our process requires unanimous consent to in order to overturn previously made decisions, and attempting to do it all at once is a tall order. Only a few people are acknowledging that, and I have yet to hear a satisfying explanation of this all is happening so fast and furious.

Regardless of the outcome, this does call into question our group's ability to understand and stick to its own decisions months or years down the line without the urge to reconsider it all. Why can't it happen again -- that we want to redesign everything based on a simple requirement change request -- if it has already happened before? What were our reasons for our previous decisions? Guiding principles inspiring those reasons? What has changed about our thinking now? Are we able to precisely describe principles guiding our thoughts so that we are clear in the future, or do we just rinse & repeat some number of months down the road?

I'll point out again what I've said elsewhere: the whitespace problem can be done away with by requiring the code-internal pattern to be quoted.

Thanks @aphillips. I personally think that requiring code-internal patterns to be quoted would solve a lot, though not all, of problems. I don't discern consensus there, either.

My priorities are: ...

Discourage the idea of putting text around the match statement, or putting more than one match in a single message (both are errors).

This concern for non-simple messages is the consequence of wanting to have simple messages start in text mode. Options 1a and 3a may reduce the concern somewhat because they introduce sigils, but the consequence of that is that users need to worry about escaping more sigils.

If the decision is between declaring that any text around non-simple messages is invalid (and having implementations reject their naive attempts to do so) vs. giving more sigils for users to worry about escaping, I think more sigil escaping is a much worse problem. It is an error-prone user experience, and it forces users to think about the relationship of sigils to any host syntax they embed messages within.

@stasm I appreciate you defining your values, because as a group, we need to do that, both for evaluating options and for long-term logical consistency. I wonder if stack ranking values would be something than scrutinizing each side of every syntax tradeoff?

It makes reading a little easier and feels a little like Lisp. When I see a { I know that I'm exiting text mode and entering something different. Simple and straightforward. ... Not a fan of 2a because there are even more contexts to keep track of. I can be in text mode, code mode, or text-embedded-in-code mode.

@sffc Yep, Lisps are known for minimal, unambiguous syntax, and some dialects reduce the noise well. In cherry-picked small examples, Lisps might seem verbose, ex: 5 * PI / 2 vs. (/ (* 5 PI) 2), but in the large, it can make complicated logic straightforward to parse, ex: boolean expressions:

(or (= shifted-epact 0)
    (and (= shifted-epact 1)
         (< 10 (mod g-year 19))))

And you end up cleaning up syntactic clutter by combining things without loss of clarity, ex: a series of let definitions:

(let [year       (gregorian-year-from-fixed date)
      prior-days (- date (gregorian-new-year year))
      correction ...]
    ...)

Among things previously discussed, the EM proposal syntax (above) comes closest to this set of design principles, followed by our current syntax, followed by 2a, and then 1a and 3a are furthest. In this regard, I'm not excited by 2a, but compared to 1a/3a, 2a makes me facepalm fewer times.

I also have a strong preference towards evolving the current main syntax in the direction that we pick here, rather than ta-dah!-landing the entire complete new syntax, which will be a lot of effort, especially to review. I don't see a point in oscillating between two sub-optimal designs.

@stasm I like your sentiment behind this, and how about instead: making the minimal amount of change possible and avoiding dragging in other topics, if we can avoid it? Because I think we can avoid it. And also solidifying our decisions with clear guiding principles? We've almost gone full circle on some topics in the last 1.5 years, it feels Ouija board-esque. I'm worried about unwittingly ending up designing a Homermobile. Beyond designing a Homermobile, the thing that keeps me up at night is the thought of having to support it for the many developers potentially making the same design-induced mistakes across a very large company, and the many more orders of magnitude of end users that would deal with poorer experiences as a result.

mihnita commented 1 year ago

My stack rank: 2a > 1a > 3a

I don't feel strongly about the {# ... #} or {{ ... }} in 2a to start code mode. But just dropping the close feels wrong (as a developer unpaired brackets scream "error")

I can live with 3 or 1 if they didn't have the "magic space trimming". So overall it is not the syntax that I care about, it is the way it works.

I know we talk syntax now, but things are related. By making small decisions on bits and pieces we will end up with something that does not work well together.

Not wrapping the message part in selectors also forces us to do more "gymnastics" to try to detect the end of the message. In the current syntax once we are in text mode we only care about { (starting placeholders) and } (end placeholder / message). And have to escape them if we need them to show "as is" With 3 we are also forced to escape [[ or %

So I can probably live with 1, if it is mandatory to "quote" the pattern (in the complex case only).

I find 3 very hard to read, especially once it gets on one line. Looks like Perl :-)

mihnita commented 1 year ago

Something I commented on PR https://github.com/unicode-org/message-format-wg/pull/496 but too late, so it probably went under the radar.

I've been trying to think more like an HTML developer, also checked again the dom localization proposal, the Google soy format (which is kind of a templating language).

And I think that the "automatic trimming of spaces" will also hurt people used to html.

Let' say I do this:

<style>
  .foo { white-space: pre; }
  #bar { white-space: pre-wrap; }
</style>
...
<p>
   Hello world one!
</p>
<p space="preserve">           Hello world two!      </p>
<p class="foo">           Hello world three!      </p>
<p id="bar">           Hello world four!      </p>

This will render with a space in front of the first message, and preserves all spaces for messages 2, 3 and 4.

Now I am asked to internationalize this and prepare for translation. Using DOM localization.

So I do:

<style>
  .foo { white-space: pre; }
  #bar { white-space: pre-wrap; }
</style>
...
<p l10n="msg1">
   Hello world one!
</p>
<p l10n="msg2" space="preserve">           Hello world two!</p>
<p l10n="msg3" class="foo">           Hello world three!</p>
<p l10n="msg3" id="bar">           Hello world four!</p>

and the "message catalog" (might even be extracted automatically, gettext-like):

{
"msg1": "Hello world one!",
"msg2": "           Hello world two!",
"msg3": "           Hello world three!",
"msg3": "           Hello world four!"
}

One would expect everything to render 100% the same. All I did was move the strings in a "string bundle".

But IF the messages automatically go through MF2, the spaces in msg2, 3, and 4 are trimmed (by MF2). And things don't work like before, where I had one or more leading spaces rendered.

So it is one of those where "ah, this looks familiar", but then I am hurt by it because it really isn't the same. Our trimming of spaces interacts (negatively) with the way the browser treats spaces.

Yes, the answer is "if you want your spaces wrap the message in {...}, it is allowed (and optional)"

But why should I be hurt by that and forced to fix it? I already control what happens with the spaces somewhere else (in html or css). Every time you try to control one single behavior with several switches we are asking for trouble, because they interfere. And as a translator (sometimes even as a developer) I have no idea what the css says about the spaces. They are there. Should I escape them, or not?

That is the reason why I am arguing for WYSIWYG, both in simple mode and in complex mode. So in 2a the simple message Hello world! does not trim the spaces. The storage file might do that. Or the rendering engine (HTML?) might do that. But the string that the MF2 API sees should not do that. If it does, it hinders more than helps.

Note: I chose json to store the strings instead if the properties-like format in the proposal to not introduce another layer of unknown behavior with the message catalog (I don't know if the proposed .messages trims the spaces or not)

TLDR: trimming will actually hurt people familiar with the HTML behavior.

mradbourne commented 1 year ago

1a = 2a = 3a There are elements from each that I like:

English keywords over sigils (1a, 2a)

These statements allow recognition over recall. I think the sigil characters rely more on the user memorizing the syntax, which might make it less accessible to new users.

Variety of enclosing characters (3a)

3a's varied syntax makes it easy to distinguish [case] from {{string}} from {$var} inside a match block, which is important in single-line patterns. In contast, I think 2a's repeated use of curly braces requires more cognitive load to work out the meaning of a particular set of them. Additional nesting of curly braces is likely to exacerbate this.

All-encompassing code mode rather than code statement (2a)

3a repeats % inside a match block for each case statement. I feel that this looks clean in multi-line patterns because of the start-of-line position - it is easier to visually parse and "ignore" the sigil. For single-line patterns however, it creates more visual noise.

I prefer 2a's all-encompassing code mode if I make two assumptions:

The majority of complex patterns will use a single match block as their basis.
Many systems use single-line representations of patterns (e.g. one JSON file per locale)

If this user is working with patterns that start and end with "{# / #}", they can be easily recognized as the terminators of a code-mode string and effectively ignored, leaving just the match... when... when... content to think about. This would likely be more accessible for non-developer users (e.g. translation specialists). A syntax like "{#match ...}{#when ...}{#when ...}" doesn't offer this.

Linked with @stasm's point on mental models, the use-case above would see 2a treated like a 2-layer model in the majority of cases - i.e. start in code mode using "{# / #}" with text as layer 2. My preference for this aspect of 2a relies on users not needing to think about a more nested model for the majority of their use cases.

Other thoughts:

Have we discussed unwrapped variable references - "This string allows $var as well as {$var}"?
What about using a sigil for a code-mode line, rather than a statement? This would allow 1a and 3a to keep their current multi-line representation, but possibly simplify the single-line representation.
We have different expectations around the trimming of unquoted whitespace so, for this reason, it feels like explicit is better than implicit (i.e. text mode is WYSIWYG and it's always explicitly wrapped inside code-mode). Perhaps unquoted whitespace could be an optional shorthand for multi-line patterns only?

stasm commented 1 year ago

@Crell

@stasm I'm unclear in your final example, where does the match block end? Can anything come after "things.", and how do we know which is which? It looks like you're out of code mode there, but still within the conceptual match block, which is confusing to me.

If we introduce a "preamble" for all statements to live in, then I think the match should just declare the selectors (similar to how input and local are declarations) rather than introduce a block composed of selectors and variants.

Here's another way of spelling my final example (all sigils TBD):

{# input {$count :number}    ← Declare an input variable
   match {$count :plural}    ← Declare a selector.
#}                           ← The preamble ends here.
{[1]} One thing.
{[*]} {$count} things.

@aphillips

@eemeli has suggested that we could drop the : if we got rid of unquoted, but unquoted is extremely useful, particularly in variant keys. An alternative would be provide a built-in empty value. This would allow function to always be positionally determined.

I agree that unquoted is useful in variant keys and as option values. However, I don't see much added utility when it's also allowed as operands. In fact, I think it hurts readability because it may look like a keyword. I'm going to file a new issue about this once the general syntax direction is set, so that we don't distract this thread too much.

@mihnita

I don't feel strongly about the {# ... #} or {{ ... }} in 2a to start code mode. But just dropping the close feels wrong (as a developer unpaired brackets scream "error")

I agree that leaving an unbalanced {# or {{ would be surprising. I think the suggestion was to instead use a non-bracket marker, which doesn't suggest that it should be closed. Hence the >>, but it also could be spelled as {>>} to avoid issues with escaping.

@mihnita

And I think that the "automatic trimming of spaces" will also hurt people used to html.

I agree with you that trimming the spaces in case of simple messages is a tripping hazard. @eemeli observed that we could delegate the exact handling to the host format. I.e. Java properties would trim, while JSON wouldn't. If a translator puts a space in front of the translation in key = \ Hello or in {"key": " Hello"} then preserving it is aligned with the original intent.

OTOH, I think trimming in variant patterns is similarily aligned with the original intent, because the syntax itself suggest that we put a space after the variant key. I realize that I'm advocating for an inconsistent behavior between simple patterns and variant patterns, but in my talking to people outside the WG this seemed to be the least surprising behavior. In fact, here's what I heard (rephrased):

(From a developer) "It's OK, it looks like you're trimming around statements rather than text."
(From a non-developer) "It's OK, you preserve whitespace where I chose to put it vs. you trim when the syntax encourages it for clarity."

eemeli commented 1 year ago

@echeran: 2a is only about adding {#...#} around non-simple messages in our current syntax so that we can solve the original problem, which is to remove {...} from simple messages. All other options are introducing many of the following topics: text-mode-first everywhere, optional pattern delmiting, whitespace handling around patterns, possible i18n concerns related to whitespace, do we still like keywords?, new sigils, and importantly, having to escape those new sigils in patterns.

Consideration of those topics is on the table irrespective of the general syntax choice we're currently making, just as they've always been. Were we to choose 2a, we'd still need to consider each of the above as we're starting in text mode, not delimiting simple patterns, and introducing new sigils for entering and exiting code mode.

Why the surge in interest to reconsider everything, and all at once? And why just to solve simple messages in text mode?

The short answer here is "paradigm shift". To allow for simple messages without delimiters, we need to start in text rather than code. Previously, our wrapping syntax was built on the expectation of starting in code, and now we're not doing that. We've changed a key premise, and now we need to build up the structures around patterns again.

Thankfully, we do not currently need to look at what's happening within patterns, the data model, or the message formatting, as each of those is kept constant: It's only the syntax wrapping the patterns that we're reconsidering. So a vast majority of the work we've done and the choices we've made so far continue to be fully valid and supported.

I know our process requires unanimous consent to in order to overturn previously made decisions, and attempting to do it all at once is a tall order.

Actually, our process does not require unanimous consent. If opposition to a choice is sustained, our chair may call for a ballot to resolve the deadlock.

aphillips commented 1 year ago

I have update the ranking table for comments up to here.

ryzokuken commented 1 year ago

3a > 1a >> 2a

I prefer 3a over 1a due to what feels like unnecessary verbosity in the syntax but don't feel too strongly about this. 2a I feel is just overly complex from a DX perspective overall.

markusicu commented 1 year ago

0 > 2a >> 1a > 3a

I know that 0 is off the table. I don't understand why, since we arrived at that last year after significant discussion. It wasn't my original favorite, but I came around to it after listening to the arguments.

I strongly prefer enclosing user-visible text with visible syntax, ideally always and consistently, from my experience with ICU MessageFormat. I have extended that format, or worked with contributors to extend it, several times. I have reimplemented its parser and formatter. And then I got to work for years with developers, localization product managers, and translators at and for Google to document it, explain it, and trouble-shoot messages written in it.

The simpler and the more consistent the better. A sense of "messages have always been mostly text with a sprinkle of placeholders" needs to take a step behind making it work reliably.

One of the problems has of course been inconsistent use and trimming of white space. Always enclosing user-visible text eliminates that completely and elegantly.

Also, when you consider white space, don't limit yourselves to ASCII. Ideographic space and no-break spaces can sneak in but may or may not be just as intentional as ASCII space and line feed.

echeran commented 1 year ago

The short answer here is "paradigm shift". To allow for simple messages without delimiters, we need to...

This doesn't answer much, but it raises questions. It also doesn't address higher level technical issues of whether pulling in multiple other issues is truly necessary, or the higher level group question of why are we doing this now, and when will overturning our decisions happen yet again?

It is... unsettling. And disappointing. To put it mildly.

Actually, our process does not require unanimous consent. If opposition to a choice is sustained, our chair may call for a ballot to resolve the deadlock.

As far as the "paradigm shift", don't consider me included in deciding that because I feel excluded from this recent push. And if it will cause the amount of complexity in other parts of the syntax & user experience, like it seems it would for all the reasons above, then I definitely don't agree that the benefits outweigh the costs.

Look, I get that optional delimiters & the other topics all being brought into discussion lead to options that start to look like the EZ proposal that @eemeli coauthored. If the way we use our process is to repeat, lather & rinse until we end up with that, something seems broken. It doesn't make sense, and the implications for usage concerns and mistakes have me worried.

aphillips commented 1 year ago

Closing this issue per the discussion in the 2023-10-23 teleconference.

The consensus was to adopt "2a with additional ugliness" to be followed immediately by a discussion of PEWS. That discussion will produce changes which might include removing the ugliness foisted on 2a or adopting one of the other syntaxes in an iterative way.

sffc commented 1 year ago

I agree with @markusicu that it makes things cleaner to just start in code mode. I also agree with @eemeli that it's a paradigm shift to start in text mode instead of code mode. It sounds like the committee is moving in that direction, though, so I'd advise some patience and humility when arriving at the new syntax.

sffc commented 1 year ago

In other words: my overall feeling is that we should start in code mode (option 0); it looks a little strange at first but it's an easy mental model to grok and everything flows elegantly from there. However, if the committee wishes to start in text mode, in my opinion, I think it's worth fully embracing it and designing an equally elegant syntax (something in the direction of 1a) rather than "well it's text mode except when it's not" (option 2a). That is, "0 > 1a > 2a".

I don't see a path toward a good outcome of starting in text mode without setting the committee back several more quarters.

sffc commented 1 year ago

One more observation: No one as far as I can tell really loves 2a, and many people strongly dislike it. In the first vote, most people who preferred 2a actually preferred 0. Option 2a is just an ugly middle ground. Speaking personally, I don't want to see this committee land on such a solution. I would rather have an elegant solution with tradeoffs than a grotesque solution that nobody loves.

(sorry for the multiple replies)

aphillips commented 1 year ago

@sffc (and @markusicu). Thank you for your comments. Please note that this thread is closed. Comments about syntax should be directed to #474 or to the pending update of the pattern-exterior-whitespace design doc.

Option 0 is cleaner if we only consider the world of message format messages, not the world of localizable messages that formattable messages lives within or the feedback from the larger community. The group has fairly solid consensus to start in text mode.

2a-with-additional-ugliness was "chosen" to enable this group to make progress on the "elegant solution with tradeoffs". The primary problem is that there is a schism in the group.

One group feels that variant patterns must always preserve whitespace (either this requires that patterns always be quoted or that whitespace between syntax and the pattern always be significant i.e. part of the pattern)
The other group feels that unquoted variant patterns should be trimmed

Note that it is always possible to quote the pattern or the whitespace. Note that non-variant patterns (i.e. simple messages such as Hello {$user}) are currently treated as space-significant.

Different syntax options can be applied to any of these in a quest for elegance. Generally speaking, options like 1a, 3a and the like are having to deal with the need to describe how best to distinguish code from text in cases where the pattern is unquoted.

This group's next step is to directly tackle the problem of pattern exterior whitespace for patterns. If we can achieve a consensus that allows unquoted patterns, we are likely to adopt a syntax designed for that (i.e. based on 1a or 3a or @stasm's predicate block proposal). If we achieve a consensus that says all variant pattern must be quoted, we'll likely move to beautify 2a (which is designed for that).

I don't see a path toward a good outcome of starting in text mode without setting the committee back several more quarters.

I disagree. If we can resolve the code/pattern boundary issue (either by quoting patterns, or choosing a trimming strategy for unquoted patterns [which includes an option of never trimming]) then we are well positioned to deliver all of the remaining details.

If this group cannot compromise on this issue, I will be forced to use the official voting mechanism in our process. I am most strenuously seeking to avoid that, as I think such a step would produce undesirable outcomes.