unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
236 stars 34 forks source link

[FEEDBACK] syntax: two ambiguities in the reserved-statement rule #721

Closed bhaible closed 1 month ago

bhaible commented 8 months ago

The rule for reserved-statement in https://github.com/unicode-org/message-format-wg/blob/main/spec/syntax.md and https://github.com/unicode-org/message-format-wg/blob/main/spec/message.abnf

reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)

contains two ambiguities: 1) If there is more than one whitespace character after the reserved-keyword, it is ambiguous how many of these whitespace characters are part of the s rule, and how many of them are at the start of the reserved-body. 2) U+3000 characters before the expression can be parsed at the end of the reserved-body or as part of the s rule.

Example (using \u escapes for legibility): The input string

.regex   /foo/\u3000\u3000{xyz}{{hello}}

contains a reserved-statement for .regex /foo/\u3000\u3000{xyz} and a complex-body for {{hello}}. Inside this reserved-statement, there are 3 * 3 = 9 possibilities:

  '   ' parsed as s
  '/foo/\u3000\u3000' parsed as reserved-body
  '' parsed as [s]

  '  ' parsed as s
  ' /foo/\u3000\u3000' parsed as reserved-body
  '' parsed as [s]

  ' ' parsed as s
  '  /foo/\u3000\u3000' parsed as reserved-body
  '' parsed as [s]

  '   ' parsed as s
  '/foo/\u3000' parsed as reserved-body
  '\u3000' parsed as [s]

  '  ' parsed as s
  ' /foo/\u3000' parsed as reserved-body
  '\u3000' parsed as [s]

  ' ' parsed as s
  '  /foo/\u3000' parsed as reserved-body
  '\u3000' parsed as [s]

  '   ' parsed as s
  '/foo/' parsed as reserved-body
  '\u3000\u3000' parsed as [s]

  '  ' parsed as s
  ' /foo/' parsed as reserved-body
  '\u3000\u3000' parsed as [s]

  ' ' parsed as s
  '  /foo/' parsed as reserved-body
  '\u3000\u3000' parsed as [s]

It appears that the contents of the reserved-body is meant to appear as the body field of an UnsupportedStatement element in the data model (cf. https://github.com/unicode-org/message-format-wg/blob/main/spec/data-model/README.md ). Therefore it matters which of these 9 possibilities the parser chooses.

Please, specify how these two ambiguities should be resolved.

aphillips commented 8 months ago

You're right, although note that the trailing expression does not use any whitespace it captures. Presumably the reserved body should capture all trailing whitespace, in case it wants it for something.

The production for reserved-body is meant to allow an arbitrary blob of tokens, with minimal constraint on the structure, to appear before at least one expression. The optional s production at the start allows spaces to be "stirred in".

In general our syntax treats whitespace as exterior to the meaningful portions. Required whitespace exists to keep tokens apart (for example, between keys in a variant). Optional whitespace can be removed--except, probably, in reserved-body, where it might be meaningful (or necessary at the start, in some cases).

reserved-body should be set up not to have meaningful trailing whitespace.

It's tempting to say that a reserved keyword must be followed by space, but statements like .keyword(body){expression} or .keyword{expression} should be possible. At the same time, we only have to look at .local to see a required space.

bhaible commented 8 months ago

In general our syntax treats whitespace as exterior to the meaningful portions. Required whitespace exists to keep tokens apart ...

OK, so if I understand it correctly, the ambiguity resolution, in the example above, would be:

  '   ' parsed as s
  '/foo/' parsed as reserved-body
  '\u3000\u3000' parsed as [s]

Did I understand you correctly?

bhaible commented 8 months ago

It's tempting to say that a reserved keyword must be followed by space

This would be hard to understand for users. The mental model users generally have is "spaces are needed to separate tokens which would otherwise combine to a single token".

At the same time, we only have to look at .local to see a required space.

Yes, but that is because .local ends with an alphabetic character and the next token, a variable, starts with a $ which we consider to act like an alphabetic character. Without the space, .local$foo would be confusing to many users.

aphillips commented 1 month ago

We removed reserved statement, so closing as out-of-scope